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PREFACE 



Signal processing theory plays an increasingly central role in the 
development of modern telecommunication and information processing 
systems, and has a wide range of applications in multimedia technology, 
audio-visual signal processing, cellular mobile communication, adaptive 
network management, radar systems, pattern analysis, medical signal 
processing, financial data forecasting, decision making systems, etc. The 
theory and application of signal processing is concerned with the 
identification, modelling and utilisation of patterns and structures in a 
signal process. The observation signals are often distorted, incomplete and 
noisy. Hence, noise reduction and the removal of channel distortion is an 
important part of a signal processing system. The aim of this book is to 
provide a coherent and structured presentation of the theory and 
applications of statistical signal processing and noise reduction methods. 

This book is organised in 15 chapters. 

Chapter 1 begins with an introduction to signal processing, and 
provides a brief review of signal processing methodologies and 
applications. The basic operations of sampling and quantisation are 
reviewed in this chapter. 

Chapter 2 provides an introduction to noise and distortion. Several 
different types of noise, including thermal noise, shot noise, acoustic noise, 
electromagnetic noise and channel distortions, are considered. The chapter 
concludes with an introduction to the modelling of noise processes. 

Chapter 3 provides an introduction to the theory and applications of 
probability models and stochastic signal processing. The chapter begins 
with an introduction to random signals, stochastic processes, probabilistic 
models and statistical measures. The concepts of stationary, non-stationary 
and ergodic processes are introduced in this chapter, and some important 
classes of random processes, such as Gaussian, mixture Gaussian, Markov 
chains and Poisson processes, are considered. The effects of transformation 
of a signal on its statistical distribution are considered. 

Chapter 4 is on Bayesian estimation and classification. In this chapter 
the estimation problem is formulated within the general framework of 
Bayesian inference. The chapter includes Bayesian theory, classical 
estimators, the estimate-maximise method, the Cramer-Rao bound on the 
minimum-variance estimate, Bayesian classification, and the modelling of 
the space of a random signal. This chapter provides a number of examples 
on Bayesian estimation of signals observed in noise. 
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Chapter 5 considers hidden Markov models (HMMs) for non- 
stationary signals. The chapter begins with an introduction to the modelling 
of non-stationary signals and then concentrates on the theory and 
applications of hidden Markov models. The hidden Markov model is 
introduced as a Bayesian model, and methods of training HMMs and using 
them for decoding and classification are considered. The chapter also 
includes the application of HMMs in noise reduction. 

Chapter 6 considers Wiener Filters. The least square error filter is 
formulated first through minimisation of the expectation of the squared 
error function over the space of the error signal. Then a block-signal 
formulation of Wiener filters and a vector space interpretation of Wiener 
filters are considered. The frequency response of the Wiener filter is 
derived through minimisation of mean square error in the frequency 
domain. Some applications of the Wiener filter are considered, and a case 
study of the Wiener filter for removal of additive noise provides useful 
insight into the operation of the filter. 

Chapter 7 considers adaptive filters. The chapter begins with the state- 
space equation for Kalman filters. The optimal filter coefficients are 
derived using the principle of orthogonality of the innovation signal. The 
recursive least squared (RLS) filter, which is an exact sample-adaptive 
implementation of the Wiener filter, is derived in this chapter. Then the 
steepest-descent search method for the optimal filter is introduced. The 
chapter concludes with a study of the LMS adaptive filters. 

Chapter 8 considers linear prediction and sub-band linear prediction 
models. Forward prediction, backward prediction and lattice predictors are 
studied. This chapter introduces a modified predictor for the modelling of 
the short-term and the pitch period correlation structures. A maximum a 
posteriori (MAP) estimate of a predictor model that includes the prior 
probability density function of the predictor is introduced. This chapter 
concludes with the application of linear prediction in signal restoration. 

Chapter 9 considers frequency analysis and power spectrum estimation. 
The chapter begins with an introduction to the Fourier transform, and the 
role of the power spectrum in identification of patterns and structures in a 
signal process. The chapter considers non-parametric spectral estimation, 
model-based spectral estimation, the maximum entropy method, and high- 
resolution spectral estimation based on eigenanalysis. 

Chapter 10 considers interpolation of a sequence of unknown samples. 
This chapter begins with a study of the ideal interpolation of a band-limited 
signal, a simple model for the effects of a number of missing samples, and 
the factors that affect interpolation. Interpolators are divided into two 
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categories: polynomial and statistical interpolators. A general form of 
polynomial interpolation as well as its special forms (Lagrange, Newton, 
Hermite and cubic spline interpolators) are considered. Statistical 
interpolators in this chapter include maximum a posteriori interpolation, 
least squared error interpolation based on an autoregressive model, 
time-frequency interpolation, and interpolation through search of an 
adaptive codebook for the best signal. 

Chapter 11 considers spectral subtraction. A general form of spectral 
subtraction is formulated and the processing distortions that result form 
spectral subtraction are considered. The effects of processing-distortions on 
the distribution of a signal are illustrated. The chapter considers methods 
for removal of the distortions and also non-linear methods of spectral 
subtraction. This chapter concludes with an implementation of spectral 
subtraction for signal restoration. 

Chapters 12 and 13 cover the modelling, detection and removal of 
impulsive noise and transient noise pulses. In Chapter 12, impulsive noise 
is modelled as a binary-state non- stationary process and several stochastic 
models for impulsive noise are considered. For removal of impulsive noise, 
median filters and a method based on a linear prediction model of the signal 
process are considered. The materials in Chapter 13 closely follow Chapter 
12. In Chapter 13, a template-based method, an HMM-based method and an 
AR model-based method for removal of transient noise are considered. 

Chapter 14 covers echo cancellation. The chapter begins with an 
introduction to telephone line echoes, and considers line echo suppression 
and adaptive line echo cancellation. Then the problem of acoustic echoes 
and acoustic coupling between loudspeaker and microphone systems are 
considered. The chapter concludes with a study of a sub-band echo 
cancellation system 

Chapter 15 is on blind deconvolution and channel equalisation. This 
chapter begins with an introduction to channel distortion models and the 
ideal channel equaliser. Then the Wiener equaliser, blind equalisation using 
the channel input power spectrum, blind deconvolution based on linear 
predictive models, Bayesian channel equalisation, and blind equalisation 
for digital communication channels are considered. The chapter concludes 
with equalisation of maximum phase channels using higher-order statistics. 



Saeed Vaseghi 
June 2000 
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1.1 Signals and Information 

1.2 Signal Processing Methods 

1.3 Applications of Digital Signal Processing 

1.4 Sampling and Analog-to-Digital Conversion 



S ignal processing is concerned with the modelling, detection, 
identification and utilisation of patterns and structures in a signal 
process. Applications of signal processing methods include audio hi- 
fi, digital TV and radio, cellular mobile phones, voice recognition, vision, 
radar, sonar, geophysical exploration, medical electronics, and in general 
any system that is concerned with the communication or processing of 
information. Signal processing theory plays a central role in the 
development of digital telecommunication and automation systems, and in 
efficient and optimal transmission, reception and decoding of information. 
Statistical signal processing theory provides the foundations for modelling 
the distribution of random signals and the environments in which the signals 
propagate. Statistical models are applied in signal processing, and in 
decision-making systems, for extracting information from a signal that may 
be noisy, distorted or incomplete. This chapter begins with a definition of 
signals, and a brief introduction to various signal processing methodologies. 
We consider several key applications of digital signal processing in adaptive 
noise reduction, channel equalisation, pattern classification/recognition, 
audio signal coding, signal detection, spatial processing for directional 
reception of signals, Dolby noise reduction and radar. The chapter concludes 
with an introduction to sampling and conversion of continuous-time signals 
to digital signals. 
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1.1 Signals and Information 

A signal can be defined as the variation of a quantity by which information 
is conveyed regarding the state, the characteristics, the composition, the 
trajectory, the course of action or the intention of the signal source. A signal 
is a means to convey information. The information conveyed in a signal may 
be used by humans or machines for communication, forecasting, decision- 
making, control, exploration etc. Figure 1.1 illustrates an information source 
followed by a system for signalling the information, a communication 
channel for propagation of the signal from the transmitter to the receiver, 
and a signal processing unit at the receiver for extraction of the information 
from the signal. In general, there is a mapping operation that maps the 
information I(t) to the signal x(t) that carries the information, this mapping 

function may be denoted as T[- ] and expressed as 

x(t)=T[I(t)] (1.1) 

For example, in human speech communication, the voice-generating 
mechanism provides a means for the talker to map each word into a distinct 
acoustic speech signal that can propagate to the listener. To communicate a 
word zv, the talker generates an acoustic signal realisation of the word; this 

acoustic signal v(t) may be contaminated by ambient noise and/or distorted 
by a communication channel, or impaired by the speaking abnormalities of 
the talker, and received as the noisy and distorted signal y(t). In addition to 
conveying the spoken word, the acoustic speech signal has the capacity to 
convey information on the speaking characteristic, accent and the emotional 
state of the talker. The listener extracts these information by processing the 
signal yit). 

In the past few decades, the theory and applications of digital signal 
processing have evolved to play a central role in the development of modern 
telecommunication and information technology systems. 

Signal processing methods are central to efficient communication, and to 
the development of intelligent man/machine interfaces in such areas as 



Noise 




Figure 1.1 Illustration of a communication and signal processing system. 
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speech and visual pattern recognition for multimedia systems. In general, 
digital signal processing is concerned with two broad areas of information 
theory: 

(a) efficient and reliable coding, transmission, reception, storage and 
representation of signals in communication systems, and 

(b) the extraction of information from noisy signals for pattern 
recognition, detection, forecasting, decision-making, signal 
enhancement, control, automation etc. 

In the next section we consider four broad approaches to signal processing 
problems. 



1.2 Signal Processing Methods 

Signal processing methods have evolved in algorithmic complexity aiming 
for optimal utilisation of the information in order to achieve the best 
performance. In general the computational requirement of signal processing 
methods increases, often exponentially, with the algorithmic complexity. 
However, the implementation cost of advanced signal processing methods 
has been offset and made affordable by the consistent trend in recent years 
of a continuing increase in the performance, coupled with a simultaneous 
decrease in the cost, of signal processing hardware. 

Depending on the method used, digital signal processing algorithms can 
be categorised into one or a combination of four broad categories. These are 
non-parametric signal processing, model-based signal processing, Bayesian 
statistical signal processing and neural networks. These methods are briefly 
described in the following. 



1.2.1 Non-parametric Signal Processing 

Non-parametric methods, as the name implies, do not utilise a parametric 
model of the signal generation or a model of the statistical distribution of the 
signal. The signal is processed as a waveform or a sequence of digits. 
Non-parametric methods are not specialised to any particular class of 
signals, they are broadly applicable methods that can be applied to any 
signal regardless of the characteristics or the source of the signal. The 
drawback of these methods is that they do not utilise the distinct 
characteristics of the signal process that may lead to substantial 
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improvement in performance. Some examples of non-parametric methods 
include digital filtering and transform-based signal processing methods such 
as the Fourier analysis/synthesis relations and the discrete cosine transform. 
Some non-parametric methods of power spectrum estimation, interpolation 
and signal restoration are described in Chapters 9, 10 and 11. 



1.2.2 Model-Based Signal Processing 

Model-based signal processing methods utilise a parametric model of the 
signal generation process. The parametric model normally describes the 
predictable structures and the expected patterns in the signal process, and 
can be used to forecast the future values of a signal from its past trajectory. 
Model-based methods normally outperform non-parametric methods, since 
they utilise more information in the form of a model of the signal process. 
However, they can be sensitive to the deviations of a signal from the class of 
signals characterised by the model. The most widely used parametric model 
is the linear prediction model, described in Chapter 8. Linear prediction 
models have facilitated the development of advanced signal processing 
methods for a wide range of applications such as low-bit-rate speech coding 
in cellular mobile telephony, digital video coding, high-resolution spectral 
analysis, radar signal processing and speech recognition. 



1.2.3 Bayesian Statistical Signal Processing 

The fluctuations of a purely random signal, or the distribution of a class of 
random signals in the signal space, cannot be modelled by a predictive 
equation, but can be described in terms of the statistical average values, and 
modelled by a probability distribution function in a multidimensional signal 
space. For example, as described in Chapter 8, a linear prediction model 
driven by a random signal can model the acoustic realisation of a spoken 
word. However, the random input signal of the linear prediction model, or 
the variations in the characteristics of different acoustic realisations of the 
same word across the speaking population, can only be described in 
statistical terms and in terms of probability functions. Bayesian inference 
theory provides a generalised framework for statistical processing of random 
signals, and for formulating and solving estimation and decision-making 
problems. Chapter 4 describes the Bayesian inference methodology and the 
estimation of random processes observed in noise. 
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1.2.4 Neural Networks 

Neural networks are combinations of relatively simple non-linear adaptive 
processing units, arranged to have a structural resemblance to the 
transmission and processing of signals in biological neurons. In a neural 
network several layers of parallel processing elements are interconnected 
with a hierarchically structured connection network. The connection weights 
are trained to perform a signal processing function such as prediction or 
classification. Neural networks are particularly useful in non-linear 
partitioning of a signal space, in feature extraction and pattern recognition, 
and in decision-making systems. In some hybrid pattern recognition systems 
neural networks are used to complement Bayesian inference methods. Since 
the main objective of this book is to provide a coherent presentation of the 
theory and applications of statistical signal processing, neural networks are 
not discussed in this book. 



1.3 Applications of Digital Signal Processing 

In recent years, the development and commercial availability of increasingly 
powerful and affordable digital computers has been accompanied by the 
development of advanced digital signal processing algorithms for a wide 
variety of applications such as noise reduction, telecommunication, radar, 
sonar, video and audio signal processing, pattern recognition, geophysics 
explorations, data forecasting, and the processing of large databases for the 
identification extraction and organisation of unknown underlying structures 
and patterns. Figure 1.2 shows a broad categorisation of some DSP 
applications. This section provides a review of several key applications of 
digital signal processing methods. 



1.3.1 Adaptive Noise Cancellation and Noise Reduction 

In speech communication from a noisy acoustic environment such as a 
moving car or train, or over a noisy telephone channel, the speech signal is 
observed in an additive random noise. In signal measurement systems the 
information-bearing signal is often contaminated by noise from its 
surrounding environment. The noisy observation y(m) can be modelled as 



y(m) = x(m ) + n(m ) 



( 1 . 2 ) 
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Speech coding, image coding, Signal and data Spectral analysis, radar Speech recognition, image 

data compression, communication communication on a pd sonar signal processing, and character recognition, 

over noisy channels adverse channels signal enhancement, signal detection 

geophysics exploration 

Figure 1.2 A classification of the applications of digital signal processing. 



where x(m ) and n( m) are the signal and the noise, and m is the discrete- 
time index. In some situations, for example when using a mobile telephone 
in a moving car, or when using a radio communication device in an aircraft 
cockpit, it may be possible to measure and estimate the instantaneous 
amplitude of the ambient noise using a directional microphone. The signal 
x(m ) may then be recovered by subtraction of an estimate of the noise from 

the noisy signal. 

Figure 1.3 shows a two-input adaptive noise cancellation system for 
enhancement of noisy speech. In this system a directional microphone takes 



Noisy signal 




Figure 1.3 Configuration of a two-microphone adaptive noise canceller. 
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as input the noisy signal x(m) + n(m) , and a second directional microphone, 
positioned some distance away, measures the noise ocn(m + T). The 

attenuation factor a and the time delay T provide a rather over-simplified 
model of the effects of propagation of the noise to different positions in the 
space where the microphones are placed. The noise from the second 
microphone is processed by an adaptive digital filter to make it equal to the 
noise contaminating the speech signal, and then subtracted from the noisy 
signal to cancel out the noise. The adaptive noise canceller is more effective 
in cancelling out the low-frequency part of the noise, but generally suffers 
from the non-stationary character of the signals, and from the over- 
simplified assumption that a linear filter can model the diffusion and 
propagation of the noise sound in the space. 

In many applications, for example at the receiver of a 
telecommunication system, there is no access to the instantaneous value of 
the contaminating noise, and only the noisy signal is available. In such cases 
the noise cannot be cancelled out, but it may be reduced, in an average 
sense, using the statistics of the signal and the noise process. Figure 1.4 
shows a bank of Wiener filters for reducing additive noise when only the 



Noisy signal 

y(m)=x(m)+n(m) Restored signal 
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Figure 1.4 A frequency-domain Wiener filter for reducing additive noise. 
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noisy signal is available. The filter bank coefficients attenuate each noisy 
signal frequency in inverse proportion to the signal-to-noise ratio at that 
frequency. The Wiener filter bank coefficients, derived in Chapter 6, are 
calculated from estimates of the power spectra of the signal and the noise 
processes. 



1.3.2 Blind Channel Equalisation 

Channel equalisation is the recovery of a signal distorted in transmission 
through a communication channel with a non-flat magnitude or a non-linear 
phase response. When the channel response is unknown the process of 
signal recovery is called blind equalisation. Blind equalisation has a wide 
range of applications, for example in digital telecommunications for 
removal of inter-symbol interference due to non-ideal channel and multi- 
path propagation, in speech recognition for removal of the effects of the 
microphones and the communication channels, in correction of distorted 
images, analysis of seismic data, de-reverberation of acoustic gramophone 
recordings etc. 

In practice, blind equalisation is feasible only if some useful statistics of 
the channel input are available. The success of a blind equalisation method 
depends on how much is known about the characteristics of the input signal 
and how useful this knowledge can be in the channel identification and 
equalisation process. Figure 1.5 illustrates the configuration of a decision- 
directed equaliser. This blind channel equaliser is composed of two distinct 
sections: an adaptive equaliser that removes a large part of the channel 
distortion, followed by a non-linear decision device for an improved 
estimate of the channel input. The output of the decision device is the final 



Channel noise 




Figure 1.5 Configuration of a decision-directed blind channel equaliser. 
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estimate of the channel input, and it is used as the desired signal to direct 
the equaliser adaptation process. Blind equalisation is covered in detail in 
Chapter 15. 



1 .3.3 Signal Classification and Pattern Recognition 

Signal classification is used in detection, pattern recognition and decision- 
making systems. For example, a simple binary-state classifier can act as the 
detector of the presence, or the absence, of a known waveform in noise. In 
signal classification, the aim is to design a minimum-error system for 
labelling a signal with one of a number of likely classes of signal. 

To design a classifier; a set of models are trained for the classes of 
signals that are of interest in the application. The simplest form that the 
models can assume is a bank, or code book, of waveforms, each 
representing the prototype for one class of signals. A more complete model 
for each class of signals takes the form of a probability distribution function. 
In the classification phase, a signal is labelled with the nearest or the most 
likely class. For example, in communication of a binary bit stream over a 
band-pass channel, the binary phase-shift keying (BPSK) scheme signals 
the bit “1” using the waveform A c sin C 0 c t and the bit “0” using —A c sin C 0 c t . 

At the receiver, the decoder has the task of classifying and labelling the 
received noisy signal as a “1” or a “0”. Figure 1.6 illustrates a correlation 
receiver for a BPSK signalling scheme. The receiver has two correlators, 
each programmed with one of the two symbols representing the binary 



Decision 




Figure 1.6 A block diagram illustration of the classifier in a binary phase-shift keying 

demodulation. 
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Figure 1.7 Configuration of speech recognition system, f(Y\fMi) is the likelihood of 

the model % given an observation sequence Y. 



states for the bit “1” and the bit “0”. The decoder correlates the unlabelled 
input signal with each of the two candidate symbols and selects the 
candidate that has a higher correlation with the input. 

Figure 1.7 illustrates the use of a classifier in a limited- vocabulary, 
isolated-word speech recognition system. Assume there are V words in the 
vocabulary. For each word a model is trained, on many different examples 
of the spoken word, to capture the average characteristics and the statistical 
variations of the word. The classifier has access to a bank of V+l models, 
one for each word in the vocabulary and an additional model for the silence 
periods. In the speech recognition phase, the task is to decode and label an 
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acoustic speech feature sequence, representing an unlabelled spoken word, 
as one of the V likely words or silence. For each candidate word the 
classifier calculates a probability score and selects the word with the highest 
score. 



1.3.4 Linear Prediction Modelling of Speech 

Linear predictive models are widely used in speech processing applications 
such as low-bit-rate speech coding in cellular telephony, speech 
enhancement and speech recognition. Speech is generated by inhaling air 
into the lungs, and then exhaling it through the vibrating glottis cords and 
the vocal tract. The random, noise-like, air flow from the lungs is spectrally 
shaped and amplified by the vibrations of the glottal cords and the resonance 
of the vocal tract. The effect of the vibrations of the glottal cords and the 
vocal tract is to introduce a measure of correlation and predictability on the 
random variations of the air from the lungs. Figure 1.8 illustrates a model 
for speech production. The source models the lung and emits a random 
excitation signal which is filtered, first by a pitch filter model of the glottal 
cords and then by a model of the vocal tract. 

The main source of correlation in speech is the vocal tract modelled by a 
linear predictor. A linear predictor forecasts the amplitude of the signal at 
time m, x(m ) , using a linear combination of P previous samples 
[x(m — !),•••, x(m~ P)] as 



p 

x(m) = 2\a k x(m- k) (1.3) 

k - 1 

where x(m) is the prediction of the signal x(m ) , and the vector 
a T —{a x ,...,a P ] is the coefficients vector of a predictor of order P. The 



Pitch period 




Figure 1.8 Linear predictive model of speech. 
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Figure 1.9 Illustration of a signal generated by an all-pole, linear prediction 

model. 



prediction error e(m), i.e. the difference between the actual sample x(m ) 
and its predicted value xim ) , is defined as 

p 

e( m) = x(m) — y, a h x(m — k) (1.4) 

k = 1 



The prediction error e{m ) may also be interpreted as the random excitation 
or the so-called innovation content of x(m) . From Equation (1.4) a signal 
generated by a linear predictor can be synthesised as 

p 

x(m) = 2^ai c x(m — k) + e(m ) (1.5) 

k - 1 

Equation (1.5) describes a speech synthesis model illustrated in Figure 1.9. 



1.3.5 Digital Coding of Audio Signals 

In digital audio, the memory required to record a signal, the bandwidth 
required for signal transmission and the signal-to-quantisation-noise ratio 
are all directly proportional to the number of bits per sample. The objective 
in the design of a coder is to achieve high fidelity with as few bits per 
sample as possible, at an affordable implementation cost. Audio signal 
coding schemes utilise the statistical structures of the signal, and a model of 
the signal generation, together with information on the psychoacoustics and 
the masking effects of hearing. In general, there are two main categories of 
audio coders: model-based coders, used for low-bit-rate speech coding in 



Applications of Digital Signal Processing 



13 



Pitch and vocal-tract 



Synthesiser 




(a) Source coder 



Pitch coefficients Vocal-tract coefficients 




Figure 1.10 Block diagram configuration of a model-based speech coder. 



applications such as cellular telephony; and transform-based coders used in 
high-quality coding of speech and digital hi-fi audio. 

Figure 1.10 shows a simplified block diagram configuration of a speech 
coder-synthesiser of the type used in digital cellular telephone. The speech 
signal is modelled as the output of a filter excited by a random signal. The 
random excitation models the air exhaled through the lung, and the filter 
models the vibrations of the glottal cords and the vocal tract. At the 
transmitter, speech is segmented into blocks of about 30 ms long during 
which speech parameters can be assumed to be stationary. Each block of 
speech samples is analysed to extract and transmit a set of excitation and 
filter parameters that can be used to synthesis the speech. At the receiver, the 
model parameters and the excitation are used to reconstruct the speech. 

A transform-based coder is shown in Figure 1.11. The aim of 
transformation is to convert the signal into a form where it lends itself to a 
more convenient and useful interpretation and manipulation. In Figure 1.11 
the input signal is transformed to the frequency domain using a filter bank, 
or a discrete Fourier transform, or a discrete cosine transform. Three main 
advantages of coding a signal in the frequency domain are: 

(a) The frequency spectrum of a signal has a relatively well-defined 
structure, for example most of the signal power is usually 
concentrated in the lower regions of the spectrum. 
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x(0) 
x(l ) 



Binary coded signal 



Reconstructed 



x(2) 



x(N-l ) 




Figure 1.11 Illustration of a transform-based coder. 



(b) A relatively low-amplitude frequency would be masked in the near 
vicinity of a large-amplitude frequency and can therefore be 
coarsely encoded without any audible degradation. 

(c) The frequency samples are orthogonal and can be coded 
independently with different precisions. 

The number of bits assigned to each frequency of a signal is a variable 
that reflects the contribution of that frequency to the reproduction of a 
perceptually high quality signal. In an adaptive coder, the allocation of bits 
to different frequencies is made to vary with the time variations of the 
power spectrum of the signal. 



1.3.6 Detection of Signals in Noise 

In the detection of signals in noise, the aim is to determine if the observation 
consists of noise alone, or if it contains a signal. The noisy observation 
y(m ) can be modelled as 

y(m) = b{m)x(m) + n(m ) (1-6) 

where x(m) is the signal to be detected, n(m ) is the noise and b(m) is a 
binary-valued state indicator sequence such that b(m) = 1 indicates the 
presence of the signal xim ) and b( m) = 0 indicates that the signal is absent. 
If the signal x(m) has a known shape, then a correlator or a matched filter 
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Figure 1.12 Configuration of a matched filter followed by a threshold comparator for 

detection of signals in noise. 



can be used to detect the signal as shown in Figure 1.12. The impulse 
response him ) of the matched filter for detection of a signal x(m) is the 
time-reversed version ofx(m) given by 

h(m) = x(N - 1 — m) 0<m<N — \ (1-7) 

where N is the length of x{m ) . The output of the matched filter is given by 



N - 1 

z{m)~y^h{m-k)y{m) (1.8) 

m = 0 



The matched filter output is compared with a threshold and a binary 
decision is made as 

- fl if z(m) > threshold 
b{m)-\ (1.9) 

0 otherwise 



y\ 

where him ) is an estimate of the binary state indicator sequence b{m), and 
it may be erroneous in particular if the signal-to-noise ratio is low. Table 1.1 

A 

lists four possible outcomes that together b(m ) and its estimate b(m ) can 
assume. The choice of the threshold level affects the sensitivity of the 



yv 

b{m) 


b{m) 
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1 
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Signal present 
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1 


1 


Signal present 
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Table 1.1 Four possible outcomes in a signal detection problem. 
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Figure 1.13 Sonar: detection of objects using the intensity and time delay of 

reflected sound waves. 



detector. The higher the threshold, the less the likelihood that noise would 
be classified as signal, so the false alarm rate falls, but the probability of 
misclassification of signal as noise increases. The risk in choosing a 
threshold value 6 can be expressed as 

^(Threshold =0)=f M „ lm ,(0) + P„,„(0) (1-10) 

The choice of the threshold reflects a trade-off between the misclassification 
rate F Miss (0) and the false alarm rate P False Aiarm(^)- 



1.3.7 Directional Reception of Waves: Beam-forming 

Beam-forming is the spatial processing of plane waves received by an array 
of sensors such that the waves incident at a particular spatial angle are 
passed through, whereas those arriving from other directions are attenuated. 
Beam-forming is used in radar and sonar signal processing (Figure 1.13) to 
steer the reception of signals towards a desired direction, and in speech 
processing for reducing the effects of ambient noise. 

To explain the process of beam-forming consider a uniform linear array 
of sensors as illustrated in Figure 1.14. The term linear army implies that 
the array of sensors is spatially arranged in a straight line and with equal 
spacing d between the sensors. Consider a sinusoidal far-field plane wave 
with a frequency Fq propagating towards the sensors at an incidence angle 

of 0 as illustrated in Figure 1.14. The array of sensors samples the incoming 
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wave as it propagates in space. The time delay for the wave to travel a 
distance of d between two adjacent sensors is given by 



dsinO 

x 



c 



( 1 . 11 ) 



where c is the speed of propagation of the wave in the medium. The phase 
difference corresponding to a delay of x is given by 



< r p-2n 




d sinf? 

c 



( 1 . 12 ) 



where Tq is the period of the sine wave. By inserting appropriate corrective 



Array of sensors Array of filters 




Figure 1.14 Illustration of a beam-former, for directional reception of signals. 









18 



Introduction 



time delays in the path of the samples at each sensor, and then averaging the 
outputs of the sensors, the signals arriving from the direction 0 will be time- 
aligned and coherently combined, whereas those arriving from other 
directions will suffer cancellations and attenuations. Figure 1.14 illustrates a 
beam-former as an array of digital filters arranged in space. The filter array 
acts as a two-dimensional space-time signal processing system. The space 
filtering allows the beam-former to be steered towards a desired direction, 
for example towards the direction along which the incoming signal has the 
maximum intensity. The phase of each filter controls the time delay, and can 
be adjusted to coherently combine the signals. The magnitude frequency 
response of each filter can be used to remove the out-of-band noise. 



1.3.8 Dolby Noise Reduction 

Dolby noise reduction systems work by boosting the energy and the signal 
to noise ratio of the high-frequency spectrum of audio signals. The energy 
of audio signals is mostly concentrated in the low-frequency part of the 
spectrum (below 2 kHz). The higher frequencies that convey quality and 
sensation have relatively low energy, and can be degraded even by a low 
amount of noise. For example when a signal is recorded on a magnetic tape, 
the tape “hiss” noise affects the quality of the recorded signal. On playback, 
the higher-frequency part of an audio signal recorded on a tape have smaller 
signal-to-noise ratio than the low-frequency parts. Therefore noise at high 
frequencies is more audible and less masked by the signal energy. Dolby 
noise reduction systems broadly work on the principle of emphasising and 
boosting the low energy of the high-frequency signal components prior to 
recording the signal. When a signal is recorded it is processed and encoded 
using a combination of a pre-emphasis filter and dynamic range 
compression. At playback, the signal is recovered using a decoder based on 
a combination of a de-emphasis filter and a decompression circuit. The 
encoder and decoder must be well matched and cancel out each other in 
order to avoid processing distortion. 

Dolby has developed a number of noise reduction systems designated 
Dolby A, Dolby B and Dolby C. These differ mainly in the number of bands 
and the pre-emphasis strategy that that they employ. Dolby A, developed for 
professional use, divides the signal spectrum into four frequency bands: 
band 1 is low-pass and covers 0 Hz to 80 Hz; band 2 is band-pass and covers 
80 Hz to 3 kHz; band 3 is high-pass and covers above 3 kHz; and band 4 is 
also high-pass and covers above 9 kHz. At the encoder the gain of each band 
is adaptively adjusted to boost low-energy signal components. Dolby A 
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Frequency (kHz) 

Figure 1.15 Illustration of the pre-emphasis response of Dolby-C: upto 20 dB 
boost is provided when the signal falls 45 dB below maximum recording level. 



provides a maximum gain of 10 to 15 dB in each band if the signal level 
falls 45 dB below the maximum recording level. The Dolby B and Dolby C 
systems are designed for consumer audio systems, and use two bands 
instead of the four bands used in Dolby A. Dolby B provides a boost of up 
to 10 dB when the signal level is low (less than 45 dB than the maximum 
reference) and Dolby C provides a boost of up to 20 dB as illustrated in 
Figure 1.15. 



1.3.9 Radar Signal Processing: Doppler Frequency Shift 

Figure 1.16 shows a simple diagram of a radar system that can be used to 
estimate the range and speed of an object such as a moving car or a flying 
aeroplane. A radar system consists of a transceiver (transmitter/receiver) that 
generates and transmits sinusoidal pulses at microwave frequencies. The 
signal travels with the speed of light and is reflected back from any object in 
its path. The analysis of the received echo provides such information as 
range, speed, and acceleration. The received signal has the form 
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x(t) = A(t)cos{£0 0 [t -2r(t) / c] } 



(1.13) 



where A(t), the time- varying amplitude of the reflected wave, depends on the 
position and the characteristics of the target, r(t) is the time-varying distance 
of the object from the radar and c is the velocity of light. The time-varying 
distance of the object can be expanded in a Taylor series as 



1 .. 



1 ... 



r(t) = r 0 + rt + — rt + —ft + 



2 ! 



3! 



(U4) 



where r 0 is the distance, r is the velocity, r is the acceleration etc. 

Approximating r(t) with the first two terms of the Taylor series expansion 
we have 

r(t) ~ r 0 + ft (1-15) 

Substituting Equation (1.15) in Equation (1.13) yields 

x{t) = A(t)cos[(ft) 0 -2 rO) 0 / c)t — 2o) 0 r 0 /c] (1-16) 



Note that the frequency of reflected wave is shifted by an amount 

co d -2 rco 0 / c (1-17) 

This shift in frequency is known as the Doppler frequency. If the object is 
moving towards the radar then the distance r(t ) is decreasing with time, r is 
negative, and an increase in the frequency is observed. Conversely if the 




Figure 1.16 Illustration of a radar system. 
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object is moving away from the radar then the distance r(t) is increasing, r is 
positive, and a decrease in the frequency is observed. Thus the frequency 
analysis of the reflected signal can reveal information on the direction and 
speed of the object. The distance r 0 is given by 

r 0 = 0.5Txc (1.18) 

where T is the round-trip time for the signal to hit the object and arrive back 
at the radar and c is the velocity of light. 



1.4 Sampling and Analog-to-Digital Conversion 

A digital signal is a sequence of real-valued or complex-valued numbers, 
representing the fluctuations of an information bearing quantity with time, 
space or some other variable. The basic elementary discrete-time signal is 
the unit-sample signal <5(m) defined as 

IT m - 0 

(1-19) 

LO m ^ 0 

where m is the discrete time index. A digital signal x(m) can be expressed as 
the sum of a number of amplitude-scaled and time-shifted unit samples as 



oo 

x(m) — '5' J x(k)S(m - k) (1.20) 

k = —oo 

Figure 1.17 illustrates a discrete-time signal. Many random processes, such 
as speech, music, radar and sonar generate signals that are continuous in 




Figure 1.17 A discrete-time signal and its envelope of variation with time. 
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Figure 1.18 Configuration of a digital signal processing system. 



time and continuous in amplitude. Continuous signals are termed analog 
because their fluctuations with time are analogous to the variations of the 
signal source. For digital processing, analog signals are sampled, and each 
sample is converted into an //-bit digit. The digitisation process should be 
performed such that the original signal can be recovered from its digital 
version with no loss of information, and with as high a fidelity as is required 
in an application. Figure 1.18 illustrates a block diagram configuration of a 
digital signal processor with an analog input. The low-pass filter removes 
out-of-band signal frequencies above a pre-selected range. The sample- 
and-hold (S/H) unit periodically samples the signal to convert the 
continuous-time signal into a discrete-time signal. 

The analog-to-digital converter (ADC) maps each continuous 
amplitude sample into an n-bit digit. After processing, the digital output of 
the processor can be converted back into an analog signal using a digital-to- 
analog converter (DAC) and a low-pass filter as illustrated in Figure 1.18. 



1.4.1 Time-Domain Sampling and Reconstruction of Analog 
Signals 

The conversion of an analog signal to a sequence of n-bit digits consists of 
two basic steps of sampling and quantisation. The sampling process, when 
performed with sufficiently high speed, can capture the fastest fluctuations 
of the signal, and can be a loss-less operation in that the analog signal can be 
recovered through interpolation of the sampled sequence as described in 
Chapter 10. The quantisation of each sample into an n-bit digit, involves 
some irrevocable error and possible loss of information. However, in 
practice the quantisation error can be made negligible by using an 
appropriately high number of bits as in a digital audio hi-fi. A sampled 
signal can be modelled as the product of a continuous-time signal x(t) and a 
periodic impulse train p(t ) as 
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■^sampled (0 %(f)P(f) 

oo 

= ^x(t)8(t-mT s ) 

m=-oo 



( 1 . 21 ) 



where is the sampling interval and the sampling function pit) is defined 
as 



oo 

p(t)= X S(t-mT s ) 

m=-oo 



( 1 . 22 ) 



The spectrum P(/) of the sampling function p(t) is also a periodic impulse 
train given by 

oo 

P(f)= X S(f-kF s ) (1.23) 

k=—oo 

where F s =l/T s is the sampling frequency. Since multiplication of two time- 
domain signals is equivalent to the convolution of their frequency spectra 
we have 



oo 

*sampled(/) = ^[*(0^(01 = X(f) * P(f)= £ S(f-kF s ) (1.24) 

k=—oo 

where the operator FT[ .] denotes the Fourier transform. In Equation (1.24) 
the convolution of a signal spectrum X(f) with each impulse 8( f - kF s ) , 

shifts X(f) and centres it on kF s . Hence, as expressed in Equation (1.24), 

the sampling of a signal x( t) results in a periodic repetition of its spectrum 
X(f) centred on frequencies 0,± F S ,±2F S , When the sampling 

frequency is higher than twice the maximum frequency content of the 
signal, then the repetitions of the signal spectra are separated as shown in 
Figure 1.19. In this case, the analog signal can be recovered by passing the 
sampled signal through an analog low-pass filter with a cut-off frequency of 
F s . If the sampling frequency is less than 2 F s , then the adjacent repetitions 

of the spectrum overlap and the original spectrum cannot be recovered. The 
distortion, due to an insufficiently high sampling rate, is irrevocable and is 
known as aliasing. This observation is the basis of the Nyquist sampling 
theorem which states: a band-limited continuous-time signal, with a highest 





frequency content (bandwidth) of B Hz, can be recovered from its samples 
provided that the sampling speed F S >2B samples per second. 

In practice sampling is achieved using an electronic switch that allows a 
capacitor to charge up or down to the level of the input voltage once every 
T s seconds as illustrated in Figure 1 .20. The sample-and-hold signal can be 

modelled as the output of a filter with a rectangular impulse response, and 
with the impulse-train-sampled signal as the input as illustrated in 
Figure 1.19. 
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Figure 1.20 A simplified sample-and-hold circuit diagram. 



1.4.2 Quantisation 

For digital signal processing, continuous-amplitude samples from the 
sample-and-hold are quantised and mapped into n-bit binary digits. For 
quantisation to n bits, the amplitude range of the signal is divided into 2" 
discrete levels, and each sample is quantised to the nearest quantisation 
level, and then mapped to the binary code assigned to that level. Figure 1.21 
illustrates the quantisation of a signal into 4 discrete levels. Quantisation is a 
many-to-one mapping, in that all the values that fall within the continuum of 
a quantisation band are mapped to the centre of the band. The mapping 
between an analog sample x a (m ) and its quantised value x(m) can be 

expressed as 



x(m) — Q[x a (m)\ (1.25) 

where Q[- ] is the quantising function. 

The performance of a quantiser is measured by signal-to-quantisation 
noise ratio SQNR per bit. The quantisation noise is defined as 

e(m)=x(m)-x a (m) (1.26) 

Now consider an n-bit quantiser with an amplitude range of ±V volts. The 
quantisation step size is A=2V/2 n . Assuming that the quantisation noise is a 
zero-mean uniform process with an amplitude range of ±AI2 we can express 
the noise power as 
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(1.27) 



where f E {e{m))- 1/ A is the uniform probability density function of the 
noise. Using Equation (1.27) he signal-to-quantisation noise ratio is given 
by 

£[x 2 (m)] 



SQMf(n)=101og 10 



v 



£|> 2 (m)] 



=10 log 10 



J 
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( ^Signal ^ 

V 2 2~ 2n 13 
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= 10 log 10 3-10 log 10 



— Ml -a +6n 



V 



2 A 



V 



^Signal 



+ 10 log 10 2 



2n 



(1.28) 



> 



where P signal is the mean signal power, and a is the ratio in decibels of the 
peak signal power V 2 to the mean signal power P signal . Therefore, from 

Equation (1.28) every additional bit in an analog to digital converter results 
in 6 dB improvement in signal-to-quantisation noise ratio. 



O Continuous-amplitude samples 
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Figure 1.21 Offset-binary scalar quantisation 




Bibliography 



27 



Bibliography 

Alexander S.T. (1986) Adaptive Signal Processing Theory and 
Applications. Springer- Yerlag, New York. 

Davenport W.B. and Root W.L. (1958) An Introduction to the Theory of 
Random Signals and Noise. McGraw-Hill, New York. 

Ephraim Y. (1992) Statistical Model Based Speech Enhancement Systems. 
Proc. IEEE, 80 , 10 , pp. 1526-1555. 

Gauss K.G. (1963) Theory of Motion of Heavenly Bodies. Dover, New 
York. 

Gallager R.G. (1968) Information Theory and Reliable Communication. 
Wiley, New York. 

Haykin S. (1991) Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, 
NJ. 

Haykin S. (1985) Array Signal Processing. Prentice-Hall, Englewood 
Cliffs, NJ. 

Kailath T. (1980) Linear Systems. Prentice Hall, Englewood Cliffs, NJ. 

Kalman R.E. (1960) A New Approach to Linear Filtering and Prediction 
Problems. Trans, of the ASME, Series D, Journal of Basic Engineering, 
82, pp. 35-45. 

Kay S.M. (1993) Fundamentals of Statistical Signal Processing, Estimation 
Theory. Prentice-Hall, Englewood Cliffs, NJ. 

Lim J.S. (1983) Speech Enhancement. Prentice Hall, Englewood Cliffs, NJ. 

Lucky R.W., Salz J. and Weldon E.J. (1968) Principles of Data 
Communications. McGraw-Hill, New York. 

Kung S.Y. (1993) Digital Neural Networks. Prentice-Hall, Englewood 
Cliffs, NJ. 

Marple S.L. (1987) Digital Spectral Analysis with Applications. Prentice- 
Hall, Englewood Cliffs, NJ. 

Oppenheim A.V. and Schafer R.W. (1989) Discrete-Time Signal 
Processing. Prentice-Hall, Englewood Cliffs, NJ. 

Proakis J.G., Rader C.M., Ling F. and Nikias C.L. (1992) Advanced 
Signal Processing. Macmillan, New York. 

Rabiner L.R. and Gold B. (1975) Theory and Applications of Digital 
Processing. Prentice-Hall, Englewood Cliffs, NJ. 

Rabiner L.R. and Schafer R.W. (1978) Digital Processing of Speech 
Signals. Prentice-Hall, Englewood Cliffs, NJ. 

Scharf L.L. (1991) Statistical Signal Processing: Detection, Estimation, 
and Time Series Analysis. Addison Wesley, Reading, MA. 

Therrien C.W. (1992) Discrete Random Signals and Statistical Signal 
Processing. Prentice-Hall, Englewood Cliffs, NJ. 




28 



Introduction 



Van-Trees H.L. (1971) Detection, Estimation and Modulation Theory. 
Parts I, II and III. Wiley New York. 

Shannon C.E. (1948) A Mathematical Theory of Communication. Bell 
Systems Tech. J., 27, pp. 379-423, 623-656. 

Wilsky A.S. (1979) Digital Signal Processing, Control and Estimation 
Theory: Points of Tangency, Areas of Intersection and Parallel 
Directions. MIT Press, Cambridge, MA. 

Widrow B. (1975) Adaptive Noise Cancelling: Principles and Applications. 
Proc. IEEE, 63 , pp. 1692-1716. 

Wiener N. (1948) Extrapolation, Interpolation and Smoothing of Stationary 
Time Series. MIT Press, Cambridge, MA. 

Wiener N. (1949) Cybernetics. MIT Press, Cambridge, MA. 

Zadeh L. A. and Desoer C.A. (1963) Linear System Theory: The State- 
Space Approach. McGraw-Hill, New York. 




Advanced Digital Signal Processing and Noise Reduction, Second Edition. 

Saeed V. Vaseghi 
Copyright © 2000 John Wiley & Sons Ltd 
ISBNs: 0-471-62692-9 (Hardback): 0-470-84162-1 (Electronic) 





NOISE AND DISTORTION 



2.1 Introduction 

2.2 White Noise 

2.3 Coloured Noise 

2.4 Impulsive Noise 

2.5 Transient Noise Pulses 



2.6 Thermal Noise 

2.7 Shot Noise 

2.8 Electromagnetic Noise 

2.9 Channel Distortions 
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N oise can be defined as an unwanted signal that interferes with the 
communication or measurement of another signal. A noise itself is a 
signal that conveys information regarding the source of the noise. 
For example, the noise from a car engine conveys information regarding the 
state of the engine. The sources of noise are many, and vary from audio 
frequency acoustic noise emanating from moving, vibrating or colliding 
sources such as revolving machines, moving vehicles, computer fans, 
keyboard clicks, wind, rain, etc. to radio-frequency electromagnetic noise 
that can interfere with the transmission and reception of voice, image and 
data over the radio-frequency spectrum. Signal distortion is the term often 
used to describe a systematic undesirable change in a signal and refers to 
changes in a signal due to the non-ideal characteristics of the transmission 
channel, reverberations, echo and missing samples. 

Noise and distortion are the main limiting factors in communication and 
measurement systems. Therefore the modelling and removal of the effects of 
noise and distortion have been at the core of the theory and practice of 
communications and signal processing. Noise reduction and distortion 
removal are important problems in applications such as cellular mobile 
communication, speech recognition, image processing, medical signal 
processing, radar, sonar, and in any application where the signals cannot be 
isolated from noise and distortion. In this chapter, we study the 
characteristics and modelling of several different forms of noise. 
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2.1 Introduction 

Noise may be defined as any unwanted signal that interferes with the 
communication, measurement or processing of an information-bearing 
signal. Noise is present in various degrees in almost all environments. For 
example, in a digital cellular mobile telephone system, there may be several 
variety of noise that could degrade the quality of communication, such as 
acoustic background noise, thermal noise, electromagnetic radio-frequency 
noise, co-channel interference, radio-channel distortion, echo and processing 
noise. Noise can cause transmission errors and may even disrupt a 
communication process; hence noise processing is an important part of 
modem telecommunication and signal processing systems. The success of a 
noise processing method depends on its ability to characterise and model the 
noise process, and to use the noise characteristics advantageously to 
differentiate the signal from the noise. Depending on its source, a noise can 
be classified into a number of categories, indicating the broad physical 
nature of the noise, as follows: 

(a) Acoustic noise: emanates from moving, vibrating, or colliding 
sources and is the most familiar type of noise present in various 
degrees in everyday environments. Acoustic noise is generated by 
such sources as moving cars, air-conditioners, computer fans, traffic, 
people talking in the background, wind, rain, etc. 

(b) Electromagnetic noise: present at all frequencies and in particular at 
the radio frequencies. All electric devices, such as radio and 
television transmitters and receivers, generate electromagnetic noise. 

(c) Electrostatic noise: generated by the presence of a voltage with or 
without current flow. Fluorescent lighting is one of the more 
common sources of electrostatic noise. 

(d) Channel distortions, echo, and fading: due to non-ideal 

characteristics of communication channels. Radio channels, such as 
those at microwave frequencies used by cellular mobile phone 
operators, are particularly sensitive to the propagation characteristics 
of the channel environment. 

(e) Processing noise: the noise that results from the digital/analog 
processing of signals, e.g. quantisation noise in digital coding of 
speech or image signals, or lost data packets in digital data 
communication systems. 
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Depending on its frequency or time characteristics, a noise process can 
be classified into one of several categories as follows: 

(a) Narrowband noise: a noise process with a narrow bandwidth such as 
a 50/60 Hz ‘hum’ from the electricity supply. 

(b) White noise: purely random noise that has a flat power spectrum. 
White noise theoretically contains all frequencies in equal intensity. 

(c) Band-limited white noise: a noise with a flat spectrum and a limited 
bandwidth that usually covers the limited spectrum of the device or 
the signal of interest. 

(d) Coloured noise: non-white noise or any wideband noise whose 
spectrum has a non-flat shape; examples are pink noise, brown noise 
and autoregressive noise. 

(e) Impulsive noise: consists of short-duration pulses of random 
amplitude and random duration. 

(f) Transient noise pulses: consists of relatively long duration noise 
pulses. 



2.2 White Noise 



White noise is defined as an uncorrelated noise process with equal power at 
all frequencies (Figure 2.1). A noise that has the same power at all 
frequencies in the range of ±°° would necessarily need to have infinite 
power, and is therefore only a theoretical concept. However a band-limited 
noise process, with a flat spectrum covering the frequency range of a band- 
limited communication system, is to all intents and purposes from the point 
of view of the system a white noise process. For example, for an audio 
system with a bandwidth of 10 kHz, any flat- spectrum audio noise with a 
bandwidth greater than 10 kHz looks like a white noise. 




(a) (b) (c) 

Figure 2.1 Illustration of (a) white noise, (b) its autocorrelation, and 

(c) its power spectrum. 




/ 
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The autocorrelation function of a continuous-time zero-mean white noise 

9 

process with a variance of <7 is a delta function given by 

+ z)]=(7 2 d(T) (2.1) 

The power spectrum of a white noise, obtained by taking the Fourier 
transform of Equation (2.1), is given by 



oo 

P NN(f)= \ r NN^ e j ^ dt = <7 (2.2) 

— oo 



Equation (2.2) shows that a white noise has a constant power spectrum. 

A pure white noise is a theoretical concept, since it would need to have 
infinite power to cover an infinite range of frequencies. Furthermore, a 
discrete-time signal by necessity has to be band-limited, with its highest 
frequency less than half the sampling rate. A more practical concept is band- 
limited white noise, defined as a noise with a flat spectrum in a limited 
bandwidth. The spectrum of band-limited white noise with a bandwidth of B 
Hz is given by 



^W(/)- 




\f\<B 

otherwise 



(2.3) 



Thus the total power of a band-limited white noise process is 2 Bo 2 . The 
autocorrelation function of a discrete-time band-limited white noise process 
is given by 



r nn ('/'/ ^ — 25(7 



sin(2 nBT s k) 
2nBT s k 



(2.4) 



where T s is the sampling period. For convenience of notation T s is usually 
assumed to be unity. For the case when T=\/2B, i.e. when the sampling rate 
is equal to the Nyquist rate, Equation (2.4) becomes 

r NN (T s k)^2Ba 2 ^^-^2Ba 2 8(k) (2.5) 

7tk 



In Equation (2.5) the autocorrelation function is a delta function. 
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2.3 Coloured Noise 

Although the concept of white noise provides a reasonably realistic and 
mathematically convenient and useful approximation to some predominant 
noise processes encountered in telecommunication systems, many other 
noise processes are non-white. The term coloured noise refers to any 
broadband noise with a non-white spectrum. For example most audio- 
frequency noise, such as the noise from moving cars, noise from computer 
fans, electric drill noise and people talking in the background, has a non- 
white predominantly low-frequency spectrum. Also, a white noise passing 
through a channel is “coloured” by the shape of the channel spectrum. Two 
classic varieties of coloured noise are so-called pink noise and brown noise, 
shown in Figures 2.2 and 2.3. 





Figure 2.2 (a) A pink noise signal and (b) its magnitude spectrum. 





Figure 2.3 (a) A brown noise signal and (b) its magnitude spectrum. 
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2.4 Impulsive Noise 

Impulsive noise consists of short-duration “on/off’ noise pulses, caused by a 
variety of sources, such as switching noise, adverse channel environment in 
a communication system, drop-outs or surface degradation of audio 
recordings, clicks from computer keyboards, etc. Figure 2.4(a) shows an 
ideal impulse and its frequency spectrum. In communication systems, a real 
impulsive-type noise has a duration that is normally more than one sample 
long. For example, in the context of audio signals, short-duration, sharp 
pulses, of up to 3 milliseconds (60 samples at a 20 kHz sampling rate) may 
be considered as impulsive noise. Figures 2.4(b) and (c) illustrate two 
examples of short-duration pulses and their respective spectra. 

In a communication system, an impulsive noise originates at some point 
in time and space, and then propagates through the channel to the receiver. 
The received noise is time-dispersed and shaped by the channel, and can be 
considered as the channel impulse response. In general, the characteristics of 
a communication channel may be linear or non-linear, stationary or time 
varying. Furthermore, many communication systems, in response to a large- 
amplitude impulse, exhibit a non-linear characteristic. 




Figure 2.4 Time and frequency sketches of: (a) an ideal impulse, (b) and (c) short- 

duration pulses. 
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Figure 2.5 Illustration of variations of the impulse response of a non-linear system 

with the increasing amplitude of the impulse. 

Figure 2.5 illustrates some examples of impulsive noise, typical of 
those observed on an old gramophone recording. In this case, the 
communication channel is the playback system, and may be assumed to be 
time-invariant. The figure also shows some variations of the channel 
characteristics with the amplitude of impulsive noise. For example, in 
Figure 2.5(c) a large impulse excitation has generated a decaying transient 
pulse. These variations may be attributed to the non-linear characteristics of 
the playback mechanism. 



2.5 Transient Noise Pulses 

Transient noise pulses often consist of a relatively short sharp initial pulse 
followed by decaying low-frequency oscillations as shown in Figure 2.6. 
The initial pulse is usually due to some external or internal impulsive 
interference, whereas the oscillations are often due to the resonance of the 




Figure 2.6 (a) A scratch pulse and music from a gramophone record, (b) The 
averaged profile of a gramophone record scratch pulse. 
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communication channel excited by the initial pulse, and may be considered 
as the response of the channel to the initial pulse. In a telecommunication 
system, a noise pulse originates at some point in time and space, and then 
propagates through the channel to the receiver. The noise pulse is shaped 
by the channel characteristics, and may be considered as the channel pulse 
response. Thus we should be able to characterize the transient noise pulses 
with a similar degree of consistency as in characterizing the channels 
through which the pulses propagate. 

As an illustration of the shape of a transient noise pulse, consider the 
scratch pulses from a damaged gramophone record shown in Figures 2.6(a) 
and (b). Scratch noise pulses are acoustic manifestations of the response of 
the stylus and the associated electro-mechanical playback system to a sharp 
physical discontinuity on the recording medium. Since scratches are 
essentially the impulse response of the playback mechanism, it is expected 
that for a given system, various scratch pulses exhibit a similar 
characteristics. As shown in Figure 2.6(b), a typical scratch pulse waveform 
often exhibits two distinct regions: 

(a) the initial high-amplitude pulse response of the playback system to 
the physical discontinuity on the record medium, followed by; 

(b) decaying oscillations that cause additive distortion. The initial pulse 
is relatively short and has a duration on the order of 1-5 ms, whereas 
the oscillatory tail has a longer duration and may last up to 50 ms or 
more. 

Note in Figure 2.6(b) that the frequency of the decaying oscillations 
decreases with time. This behaviour may be attributed to the non-linear 
modes of response of the electro-mechanical playback system excited by the 
physical scratch discontinuity. Observations of many scratch waveforms 
from damaged gramophone records reveals that they have a well-defined 
profile, and can be characterised by a relatively small number of typical 
templates. Scratch pulse modelling and removal is considered in detain in 
Chapter 13. 



2.6 Thermal Noise 

Thermal noise, also referred to as Johnson noise (after its discoverer J. B. 
Johnson), is generated by the random movements of thermally energised 
particles. The concept of thermal noise has its roots in thermodynamics and 
is associated with the temperature-dependent random movements of free 
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particles such as gas molecules in a container or electrons in a conductor. 
Although these random particle movements average to zero, the fluctuations 
about the average constitute the thermal noise. For example, the random 
movements and collisions of gas molecules in a confined space produce 
random fluctuations about the average pressure. As the temperature 
increases, the kinetic energy of the molecules and the thermal noise 
increase. 

Similarly, an electrical conductor contains a very large number of free 
electrons, together with ions that vibrate randomly about their equilibrium 
positions and resist the movement of the electrons. The free movement of 
electrons constitutes random spontaneous currents, or thermal noise, that 
average to zero since in the absent of a voltage electrons move in all 
different directions. As the temperature of a conductor, provided by its 
surroundings, increases, the electrons move to higher-energy states and the 
random current flow increases. For a metallic resistor, the mean square 
value of the instantaneous voltage due to the thermal noise is given by 



v 2 = AkTRB 



( 2 . 6 ) 



where £=1.38x10 23 joules per degree Kelvin is the Boltzmann constant, T is 
the absolute temperature in degrees Kelvin, R is the resistance in ohms and 
B is the bandwidth. From Equation (2.6) and the preceding argument, a 
metallic resistor sitting on a table can be considered as a generator of 



thermal noise power, with a mean square voltage v 2 and an internal 
resistance R. From circuit theory, the maximum available power delivered 
by a “thermal noise generator”, dissipated in a matched load of resistance R, 
is given by 





/ 



v 



rms 






= kTB 




(2.7) 



where v rms is the root mean square voltage. The spectral density of thermal 



noise is given by 




(W/Hz) 



( 2 . 8 ) 



From Equation (2.8), the thermal noise spectral density has a flat shape, i.e. 
thermal noise is a white noise. Equation (2.8) holds well up to very high 
radio frequencies of 10° Hz. 
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2.7 Shot Noise 

The term shot noise arose from the analysis of random variations in the 
emission of electrons from the cathode of a vacuum tube. Discrete electron 
particles in a current flow arrive at random times, and therefore there will be 
fluctuations about the average particle flow. The fluctuations in the rate of 
particle flow constitutes the shot noise. Other instances of shot noise are the 
flow of photons in a laser beam, the flow and recombination of electrons and 
holes in semiconductors, and the flow of photoelectrons emitted in 
photodiodes. The concept of randomness of the rate of emission or arrival of 
particles implies that shot noise can be modelled by a Poisson distribution. 
When the average number of arrivals during the observing time is large, the 
fluctuations will approach a Gaussian distribution. Note that whereas 
thermal noise is due to “unforced” random movement of particles, shot noise 
happens in a forced directional flow of particles. 

Now consider an electric current as the flow of discrete electric charges. 
If the charges act independently of each other the fluctuating current is given 
by 



/Noise(rms) = ( 2 eI dc B ) m (2.9) 

where e = 1.6 xlO -19 coulomb is the electron charge, and B is the 
measurement bandwidth. For example, a “steady” current I^ c of 1 amp in a 
bandwidth 1 MHz has an rms fluctuation of 0.57 microamps. Equation (2.9) 
assumes that the charge carriers making up the current act independently. 
That is the case for charges crossing a barrier, as for example the current in a 
junction diode, where the charges move by diffusion; but it is not true for 
metallic conductors, where there are long-range correlations between charge 
carriers. 



2.8 Electromagnetic Noise 

Virtually every electrical device that generates, consumes or transmits 
power is a potential source of electromagnetic noise and interference for 
other systems. In general, the higher the voltage or the current level, and the 
closer the proximity of electrical circuits/devices, the greater will be the 
induced noise. The common sources of electromagnetic noise are 
transformers, radio and television transmitters, mobile phones, microwave 
transmitters, ac power lines, motors and motor starters, generators, relays, 
oscillators, fluorescent lamps, and electrical storms. 
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Electrical noise from these sources can be categorized into two basic 
types: electrostatic and magnetic. These two types of noise are 
fundamentally different, and thus require different noise-shielding measures. 
Unfortunately, most of the common noise sources listed above produce 
combinations of the two noise types, which can complicate the noise 
reduction problem. 

Electrostatic fields are generated by the presence of voltage, with or 
without current flow. Fluorescent lighting is one of the more common 
sources of electrostatic noise. Magnetic fields are created either by the flow 
of electric current or by the presence of permanent magnetism. Motors and 
transformers are examples of the former, and the Earth's magnetic field is an 
instance of the latter. In order for noise voltage to be developed in a 
conductor, magnetic lines of flux must be cut by the conductor. Electric 
generators function on this basic principle. In the presence of an alternating 
field, such as that surrounding a 50/60 Hz power line, voltage will be 
induced into any stationary conductor as the magnetic field expands and 
collapses. Similarly, a conductor moving through the Earth's magnetic field 
has a noise voltage generated in it as it cuts the lines of flux. 



2.9 Channel Distortions 

On propagating through a channel, signals are shaped and distorted by the 
frequency response and the attenuating characteristics of the channel. There 
are two main manifestations of channel distortions: magnitude distortion 
and phase distortion. In addition, in radio communication, we have the 




Figure 2.7 Illustration of channel distortion: (a) the input signal spectrum, (b) the 

channel frequency response, (c) the channel output. 
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multi-path effect, in which the transmitted signal may take several different 
routes to the receiver, with the effect that multiple versions of the signal 
with different delay and attenuation arrive at the receiver. Channel 
distortions can degrade or even severely disrupt a communication process, 
and hence channel modelling and equalization are essential components of 
modern digital communication systems. Channel equalization is particularly 
important in modern cellular communication systems, since the variations of 
channel characteristics and propagation attenuation in cellular radio systems 
are far greater than those of the landline systems. Figure 2.7 illustrates the 
frequency response of a channel with one invertible and two non-invertible 
regions. In the non-invertible regions, the signal frequencies are heavily 
attenuated and lost to the channel noise. In the invertible region, the signal is 
distorted but recoverable. This example illustrates that the channel inverse 
filter must be implemented with care in order to avoid undesirable results 
such as noise amplification at frequencies with a low SNR. Channel 
equalization is covered in detail in Chapter 15. 



2.10 Modelling Noise 

The objective of modelling is to characterise the structures and the patterns 
in a signal or a noise process. To model a noise accurately, we need a 
structure for modelling both the temporal and the spectral characteristics of 
the noise. Accurate modelling of noise statistics is the key to high-quality 
noisy signal classification and enhancement. Even the seemingly simple task 
of signal/noise classification is crucially dependent on the availability of 
good signal and noise models, and on the use of these models within a 
Bayesian framework. Hidden Markov models described in Chapter 5 are 
good structure for modelling signals or noise. 

One of the most useful and indispensable tools for gaining insight into 
the structure of a noise process is the use of Fourier transform for frequency 





(a) 

Figure 2.8 Illustration of: (a) the time-waveform of a drill noise, and (b) the frequency 

spectrum of the drill noise. 
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Figure 2.9 Power spectra of car noise in (a) a BMW at 70 mph, and 

(b) a Volvo at 70 mph. 



analysis. Figure 2.8 illustrates the noise from an electric drill, which, as 
expected, has a periodic structure. The spectrum of the drilling noise shown 
in Figure 2.8(a) reveals that most of the noise energy is concentrated in the 
lower-frequency part of the spectrum. In fact, it is true of most audio signals 
and noise that they have a predominantly low-frequency spectrum. 
However, it must be noted that the relatively lower-energy high-frequency 
part of audio signals plays an important part in conveying sensation and 
quality. Figures 2.9(a) and (b) show examples of the spectra of car noise 
recorded from a BMW and a Volvo respectively. The noise in a car is 
nonstationary, and varied, and may include the following sources: 

(a) quasi-periodic noise from the car engine and the revolving mechanical 
parts of the car; 

(b) noise from the surface contact of wheels and the road surface; 

(c) noise from the air flow into the car through the air ducts, windows, 

sunroof, etc; 

(d) noise from passing/overtaking vehicles. 

The characteristic of car noise varies with the speed, the road surface 
conditions, the weather, and the environment within the car. 

The simplest method for noise modelling, often used in current practice, 
is to estimate the noise statistics from the signal-inactive periods. In optimal 
Bayesian signal processing methods, a set of probability models are trained 
for the signal and the noise processes. The models are then used for the 
decoding of the underlying states of the signal and noise, and for noisy 
signal recognition and enhancement. 
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2.10.1 Additive White Gaussian Noise Model (AWGN) 

In communication theory, it is often assumed that the noise is a stationary 
additive white Gaussian (AWGN) process. Although for some problems this 
is a valid assumption and leads to mathematically convenient and useful 
solutions, in practice the noise is often time-varying, correlated and non- 
Gaussian. This is particularly true for impulsive-type noise and for acoustic 
noise, which are non-stationary and non-Gaussian and hence cannot be 
modelled using the AWGN assumption. Non-stationary and non-Gaussian 
noise processes can be modelled by a Markovian chain of stationary sub- 
processes as described briefly in the next section and in detail in Chapter 5. 



2.10.2 Hidden Markov Model for Noise 

Most noise processes are non-stationary; that is the statistical parameters of 
the noise, such as its mean, variance and power spectrum, vary with time. 
Nonstationary processes may be modelled using the hidden Markov models 
(HMMs) described in detail in Chapter 5. An HMM is essentially a finite- 
state Markov chain of stationary subprocesses. The implicit assumption in 
using HMMs for noise is that the noise statistics can be modelled by a 
Markovian chain of stationary subprocesses. Note that a stationary noise 
process can be modelled by a single-state HMM. For a non-stationary noise, 
a multistate HMM can model the time variations of the noise process with a 
finite number of stationary states. For non-Gaussian noise, a mixture 
Gaussian density model can be used to model the space of the noise within 
each state. In general, the number of states per model and number of 
mixtures per state required to accurately model a noise process depends on 



a = a 

01 




(a) (b) 

Figure 2.10 (a) An impulsive noise sequence, (b) A binary-state model of impulsive 



noise. 
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the non- stationary character of the noise. 

An example of a non- stationary noise is the impulsive noise of Figure 
2.10(a). Figure 2.10(b) shows a two-state HMM of the impulsive noise 
sequence: the state S () models the “impulse-off’ periods between the 

impulses, and state 5, models an impulse. In those cases where each impulse 
has a well-defined temporal structure, it may be beneficial to use a multi- 
state HMM to model the pulse itself. HMMs are used in Chapter 11 for 
modelling impulsive noise, and in Chapter 14 for channel equalisation. 
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The small probability of collision of the Earth and a comet can become very 
great in adding over a long sequence of centuries. It is easy to picture the 
effects of this impact on the Earth. The axis and the motion of rotation have 
changed, the seas abandoning their old position... 

Pierre- Simon Laplace 
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3.1 Random Signals and Stochastic Processes 

3.2 Probabilistic Models 

3.3 Stationary and Non-stationary Processes 

3.4 Expected Values of a Process 

3.5 Some Useful Classes of Random Processes 

3.6 Transformation of a Random Process 

3.7 Summary 

P robability models form the foundation of information theory. 
Information itself is quantified in terms of the logarithm of 
probability. Probability models are used to characterise and predict 
the occurrence of random events in such diverse areas of applications as 
predicting the number of telephone calls on a trunk line in a specified period 
of the day, road traffic modelling, weather forecasting, financial data 
modelling, predicting the effect of drugs given data from medical trials, etc. 
In signal processing, probability models are used to describe the variations 
of random signals in applications such as pattern recognition, signal coding 
and signal estimation. This chapter begins with a study of the basic concepts 
of random signals and stochastic processes and the models that are used for 
the characterisation of random processes. Stochastic processes are classes of 
signals whose fluctuations in time are partially or completely random, such 
as speech, music, image, time-varying channels, noise and video. Stochastic 
signals are completely described in terms of a probability model, but can 
also be characterised with relatively simple statistics, such as the mean, the 
correlation and the power spectrum. We study the concept of ergodic 
stationary processes in which time averages obtained from a single 
realisation of a process can be used instead of ensemble averages. We 
consider some useful and widely used classes of random signals, and study 
the effect of filtering or transformation of a signal on its probability 
distribution. 





Random Signals and Stochastic Processes 



45 



3.1 Random Signals and Stochastic Processes 

Signals, in terms of one of their most fundamental characteristics, can be 
classified into two broad categories: deterministic signals and random 
signals. Random functions of time are often referred to as stochastic signals. 
In each class, a signal may be continuous or discrete in time, and may have 
continuous- valued or discrete-valued amplitudes. 

A deterministic signal can be defined as one that traverses a 
predetermined trajectory in time and space. The exact fluctuations of a 
deterministic signal can be completely described in terms of a function of 
time, and the exact value of the signal at any time is predictable from the 
functional description and the past history of the signal. For example, a sine 
wave x(t) can be modelled, and accurately predicted either by a second-order 
linear predictive model or by the more familiar equation x(t)=A s\r\(2nft+(p). 

Random signals have unpredictable fluctuations; hence it is not possible 
to formulate an equation that can predict the exact future value of a random 
signal from its past history. Most signals such as speech and noise are at 
least in part random. The concept of randomness is closely associated with 
the concepts of information and noise. Indeed, much of the work on the 
processing of random signals is concerned with the extraction of 
information from noisy observations. If a signal is to have a capacity to 
convey information, it must have a degree of randomness: a predictable 
signal conveys no information. Therefore the random part of a signal is 
either the information content of the signal, or noise, or a mixture of both 
information and noise. Although a random signal is not completely 
predictable, it often exhibits a set of well-defined statistical characteristic 
values such as the maximum, the minimum, the mean, the median, the 
variance and the power spectrum. A random process is described in terms of 
its statistics, and most completely in terms of a probability model from 
which all its statistics can be calculated. 

Example 3.1 Figure 3.1(a) shows a block diagram model of a 
deterministic discrete-time signal. The model generates an output signal 
x{m) from the P past samples as 

x{m)-h x ( x(m - 1), x(m - 2 ),..., x(m — P)) (3.1) 

where the function h\ may be a linear or a non-linear model. A functional 

description of the model h x and the P initial sample values are all that is 

required to predict the future values of the signal x{m). For example for a 
sinusoidal signal generator (or oscillator) Equation (3.1) becomes 
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(a) 



Random 



(b) 



x(m)=h l (x(m-l ), .... x(in-P)) 




+e(m ) 



Figure 3.1 Illustration of deterministic and stochastic signal models: (a) a 
deterministic signal model, (b) a stochastic signal model. 



x(m)—ax(m — 1) — x(m-2) (3.2) 

where the choice of the parameter a=2cos(2nF 0 /F s ) determines the 
oscillation frequency F () of the sinusoid, at a sampling frequency of F s . 
Figure 3.1(b) is a model for a stochastic random process given by 

x(m)=h 2 {x(m - 1 ),x(m - 2 x(m - P))+e(m ) (3.3) 

where the random input e(m ) models the unpredictable part of the signal 
x(m ) , and the function li 2 models the part of the signal that is correlated 

with the past samples. For example, a narrowband, second-order 
autoregressive process can be modelled as 

x(m)=ai x(m — 1) + a 2 x(m~2)+e(m) (3.4) 

where the choice of the parameters a l and a 2 will determine the centre 
frequency and the bandwidth of the process. 
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3.1.1 Stochastic Processes 

The term “stochastic process” is broadly used to describe a random process 
that generates sequential signals such as speech or noise. In signal 
processing terminology, a stochastic process is a probability model of a class 
of random signals, e.g. Gaussian process, Markov process, Poisson process, 
etc. The classic example of a stochastic process is the so-called Brownian 
motion of particles in a fluid. Particles in the space of a fluid move 
randomly due to bombardment by fluid molecules. The random motion of 
each particle is a single realisation of a stochastic process. The motion of all 
particles in the fluid forms the collection or the space of different 
realisations of the process. 

In this chapter, we are mainly concerned with discrete-time random 
processes that may occur naturally or may be obtained by sampling a 
continuous-time band-limited random process. The term “discrete-time 
stochastic process” refers to a class of discrete-time random signals, X(m), 

characterised by a probabilistic model. Each realisation of a discrete 
stochastic process X(m ) may be indexed in time and space as x(m,s), 
where m is the discrete time index, and 5 is an integer variable that 
designates a space index to each realisation of the process. 

3.1.2 The Space or Ensemble of a Random Process 

The collection of all realisations of a random process is known as the 
ensemble, or the space, of the process. For an illustration, consider a random 
noise process over a telecommunication network as shown in Figure 3.2. 
The noise on each telephone line fluctuates randomly with time, and may be 
denoted as n(m,s), where m is the discrete time index and s denotes the line 
index. The collection of noise on different lines form the ensemble (or the 
space) of the noise process denoted by N(m)={n(m,s)}, where n(m,s) 
denotes a realisation of the noise process N(m ) on the line s. The “true” 
statistics of a random process are obtained from the averages taken over the 
ensemble of many different realisations of the process. However, in many 
practical cases, only one realisation of a process is available. In Section 3.4, 
we consider the so-called ergodic processes in which time-averaged 
statistics, from a single realisation of a process, may be used instead of the 
ensemble-averaged statistics. 

Notation The following notation is used in this chapter: X(m ) denotes a 
random process, the signal x(m,s ) is a particular realisation of the process 
X(m), the random signal x(m) is any realisation of X(m), and the collection 
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Figure 3.2 Illustration of three realisations in the space of a random noise N(m). 



of all realisations of X(m), denoted by {x(m,s)}, form the ensemble or the 
space of the random process X(m). 



3.2 Probabilistic Models 

Probability models provide the most complete mathematical description of a 
random process. For a fixed time instant m, the collection of sample 
realisations of a random process {x(m,s)} is a random variable that takes on 
various values across the space 5 of the process. The main difference 
between a random variable and a random process is that the latter generates 
a time series. Therefore, the probability models used for random variables 
may also be applied to random processes. We start this section with the 
definitions of the probability functions for a random variable. 

The space of a random variable is the collection of all the values, or 
outcomes, that the variable can assume. The space of a random variable can 
be partitioned, according to some criteria, into a number of subspaces. A 
subspace is a collection of signal values with a common attribute, such as a 
cluster of closely spaced samples, or the collection of samples with their 
amplitude within a given band of values. Each subspace is called an event, 
and the probability of an event A, P(A), is the ratio of the number of 
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Figure 3.3 A two-dimensional representation of the outcomes of two dice, and the 
subspaces associated with the events corresponding to the sum of the dice being 

greater than 8 or, less than or equal to 8. 



observed outcomes from the space of A, N A , divided by the total number of 
observations: 



P(A) = 




All events i 



(3.5) 



From Equation (3.5), it is evident that the sum of the probabilities of all 
likely events in an experiment is unity. 

Example 3.2 The space of two discrete numbers obtained as outcomes of 
throwing a pair of dice is shown in Figure 3.3. This space can be partitioned 
in different ways; for example, the two subspaces shown in Figure 3.3 are 
associated with the pair of numbers that add up to less than or equal to 8, 
and to greater than 8. In this example, assuming the dice are not loaded, all 
numbers are equally likely, and the probability of each event is proportional 
to the total number of outcomes in the space of the event. 



3.2.1 Probability Mass Function (pmf) 

For a discrete random variable X that can only assume discrete values from a 
finite set of N numbers {x lt x 2 , ..., x N } , each outcome x i may be considered 

as an event and assigned a probability of occurrence. The probability that a 
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discrete-valued random variable X takes on a value of x ( -, P(X= x ; ), is called 

the probability mass function (pmf). For two such random variables X and Y, 
the probability of an outcome in which X takes on a value of x; and Y takes 

on a value of y., P(X=x r Y=yj), is called the joint probability mass function. 

The joint pmf can be described in terms of the conditional and the marginal 
probability mass functions as 



p x,y ( x i^j)- p Y\x (y j I X i) p x ( x i ) 

= P X\Y ( x (l J j) P Y (y j) 



(3.6) 



where P Y \ X {yj\ x i) is the probability of the random variable Y taking on a 
value of V,- conditioned on X having taken a value of x,-, and the so-called 
marginal pmf of X is obtained as 




M 

( x i )~^j P X,Y ( x i ’ y j ) 
7=1 



M 

- X ^x\y ( x i i y j ) p y ( y j ) 




(3.7) 



where M is the number of values, or outcomes, in the space of the discrete 
random variable Y. From Equations (3.6) and (3.7), we have Bayes’ rule for 
the conditional probability mass function, given by 



P x\ y(*/l >’ j ) ~ 



1 



P Y\X (y j I x i) P X ( x i ) 



P Y (yj) 

P Y\X (y x i ) P X ( x i ) 

I^IX(^I x i ) P X ( x i ) 
i = 1 



(3.8) 



3.2.2 Probability Density Function (pdf) 

Now consider a continuous-valued random variable. A continuous-valued 
variable can assume an infinite number of values, and hence, the probability 
that it takes on a given value vanishes to zero. For a continuous-valued 
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random variable X the cumulative distribution function (cdf) is defined as 
the probability that the outcome is less than x as: 

F x (x ) = Prob(X < x) (3.9) 

where Prob ( ■ ) denotes probability. The probability that a random variable X 
takes on a value within a band of A centred on x can be expressed as 

-Prob(x-A/2< X<x + A/2)=-[Prob(X<x + A/2)-Prob(X<x-A/2 )] 
A A 

= f_F x {x + AI2)-F x (x-AI2)\ (3.10) 



As A tends to zero we obtain the probability density function ( pdf) as 



fx (*) = lim - A \F X (x + A / 2) - F x (x - A / 2)] 

zJ->0d 
_d F x (x) 

dx 



(3.11) 



Since F x (x) increases with x, the pdf of x, which is the rate of change of 
F x (x) with x, is a non-negative-valued function; i.e. fx(x) ^ 0. The integral 
of the pdf of a random variable X in the range ± °° is unity: 



oo 



jfx(*)dx=l 



— oo 



(3.12) 



The conditional and marginal probability functions and the Bayes rule, of 
Equations (3.6)-(3.8), also apply to probability density functions of 
continuous-valued variables. 

Now, the probability models for random variables can also be applied to 
random processes. For a continuous-valued random process X(m), the 
simplest probabilistic model is the univariate pdf fxc m )(x), which is the 

probability density function that a sample from the random process X(m) 
takes on a value of x. A bivariate pdf fx( m )X{m+n)( x t, x t) describes the 

probability that the samples of the process at time instants m and m+n take 
on the values x^ and x 2 respectively. In general, an M - variate pdf 
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fx(m l )X(m 2 )-x(m M )( x v x 2’---’ x M) describes the pdf of M samples of a 

random process taking specific values at specific time instants. For an M- 
variate pdf, we can write 



oo 



J f X ( ’ * • *’ f X (**1 ’ * • •’ ^M— 1 ) (3.13) 



— oo 



and the sum of the pdfs of all possible realisations of a random process is 
unity, i.e. 



oo 



oo 



I )dx l ...dx M -1 



— oo 



— oo 



( 3 . 14 ) 



The probability of a realisation of a random process at a specified time 
instant may be conditioned on the value of the process at some other time 
instant, and expressed in the form of a conditional probability density 
function as 




x 



m 



X 



n 



)= 







n 



x m X ( m ) (-bn ) 



f X ( n) ( x «) 



( 3 . 15 ) 



If the outcome of a random process at any time is independent of its 
outcomes at other time instants, then the random process is uncorrelated. 
For an uncorrelated process a multivariate pdf can be written in terms of the 
products of univariate pdfs as 



( \ M 

f\X(m l )---X(m M )\X(n l )---X(n N )yd c m l X n x ’•••’ X n N / 1 J_ fx ( nij ) (■*■/«,• ) 



/= 1 



( 3 . 16 ) 



Discrete-valued stochastic processes can only assume values from a finite 
set of allowable numbers [x±, x 2 , ..., v,J. An example is the output of a 

binary message coder that generates a sequence of Is and Os. Discrete-time, 
discrete-valued, stochastic processes are characterised by multivariate 
probability mass functions (pmf) denoted as 



)]C*"(^1 ) x i • • i x (P^M ) x k ) 



( 3 . 17 ) 
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The probability that a discrete random process X(m) takes on a value of x m 
at time instant m can be conditioned on the process taking on a value x n at 
some other time instant n, and expressed in the form of a conditional pmf as 



Px{m)\X(n)\ x m\ x n)- 



(n)\X ( m ) C'Vf T m /^X ( m ) (-^m ) 



X(n) 



(x „ ) 



(3.18) 



and for a statistically independent process we have 



[X(m 1 )---X(m M )IX(n 1 )...X(n JV )] 



M 

(•^mj »• • »• • '^n N ) “” | | (m ; ) (^i ) — ) 



/— 1 



(3.19) 



3.3 Stationary and Non-Stationary Random Processes 

Although the amplitude of a signal x(m) fluctuates with time m, the 
characteristics of the process that generates the signal may be time-invariant 
(stationary) or time- varying (non-stationary). An example of a non- 
stationary process is speech, whose loudness and spectral composition 
changes continuously as the speaker generates various sounds. A process is 
stationary if the parameters of the probability model of the process are time- 
invariant; otherwise it is non-stationary (Figure 3.4). The stationarity 
property implies that all the parameters, such as the mean, the variance, the 
power spectral composition and the higher-order moments of the process, 
are time-invariant. In practice, there are various degrees of stationarity: it 
may be that one set of the statistics of a process is stationary, whereas 
another set is time-varying. For example, a random process may have a 
time-invariant mean, but a time- varying power. 





Figure 3.4 Examples of a quasistationary and a non-stationary speech segment. 
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Example 3.3 In this example, we consider the time-averaged values of the 
mean and the power of: (a) a stationary signal Asinrnt and (b) a transient 

signal Ae' at . 

The mean and power of the sinusoid are 



1 r 

Mean(A sin ox) — — I A sin 0)t dt — 0 , 

1 T 



constant 



1 f A 2 

Poweri A sin 0)t) —— A 2 sin 2 (Otdt — — , 

T J 2 



constant 



T 

Where T is the period of the sine wave. The mean and the power 
transient signal are given by: 



(3.20) 

(3.21) 



of the 



Mean(Ae at 






time-varying 

(3.22) 



1 r A 2 

Poweri Ae~ at )=— A 2 e~ 2ar dt = (1 - e~ 2aT )e~ 2at , time-varying 

T J 2aT 

t 

(3.23) 

In Equations (3.22) and (3.23), the signal mean and power are exponentially 
decaying functions of the time variable t. 

Example 3.4 Consider a non-stationary signal y(m) generated by a binary- 
state random process described by the following equation: 

y(m) = s(m)xQ(m)+s(m)x l (m) (3.24) 

where s(m ) is a binary- valued state indicator variable and s{m) denotes the 
binary complement of s(m). From Equation (3.24), we have 



a; n (m) if s(m) - 0 
y(m)-\ 0 

[x| (m) if s(m) = 1 



(3.25) 
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Let fi x and P x denote the mean and the power of the signal x (} (m), and 
At and P x the mean and the power of X\(m) respectively. The expectation 
of y(m), given the state s(m), is obtained as 

‘£[y(m)\s(m)]=s(m)‘£[x 0 (m)]+s(m)‘E[x 1 (m )] 

(3.26) 

= s(m)fi XQ + s(m)fi Xi 



In Equation (3.26), the mean of y(m) is expressed as a function of the state 
of the process at time m. The power of y(m) is given by 



“E [y 2 (/n)|,y(m) =5(m)£[vQ (m)]+ 5(m)£[v 1 2 (m)] 

= s(m)P ' + s(m)P x 



(3.27) 



x i 



Although many signals are non-stationary, the concept of a stationary 
process has played an important role in the development of signal 
processing methods. Furthermore, even non-stationary signals such as 
speech can often be considered as approximately stationary for a short 
period of time. In signal processing theory, two classes of stationary 
processes are defined: (a) strict-sense stationary processes and (b) wide- 
sense stationary processes, which is a less strict form of stationarity, in that 
it only requires that the first-order and second-order statistics of the process 
should be time-invariant. 



3.3.1 Strict-Sense Stationary Processes 

A random process X(m) is stationary in a strict sense if all its distributions 
and statistical parameters are time-invariant. Strict-sense stationarity implies 
that the n th order distribution is translation-invariant for all n= 1 , 2,3, . . . : 



Prob[x(m l )<x l ,x(m 2 )<x 2 ,. . . ,x(m n )<x n )] 

= Prob[x(m l + z)<x l ,x(m 2 + z)<x 2 ,...,x(m n +z)<x n )] 



(3.28) 



From Equation (3.28) the statistics of a strict-sense stationary process 
including the mean, the correlation and the power spectrum, are time- 
invariant; therefore we have 



‘E[x(m)]=ii x 



(3.29) 
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‘E[x(m)x(m + k)] = r xx (k) (3.30) 

and 

£[l l 2 ]=‘E[l X(f) 1 2 ] = P XX (/) (3.31) 

where /i x , r xx (m) and Pxxtf) are the mean value, the autocorrelation and the 

power spectrum of the signal x(m ) respectively, and X(f,m ) denotes the 
frequency-time spectrum of x(m). 



3.3.2 Wide-Sense Stationary Processes 

The strict-sense stationarity condition requires that all statistics of the 
process should be time-invariant. A less restrictive form of a stationary 
process is so-called wide-sense stationarity. A process is said to be wide- 
sense stationary if the mean and the autocorrelation functions of the process 
are time invariant: 



“E[x(m)] = fl x 


(3.32) 


“E[x(m)x(m + £)] - r xx ( k ) 


(3.33) 



From the definitions of strict-sense and wide-sense stationary processes, it is 
clear that a strict-sense stationary process is also wide-sense stationary, 
whereas the reverse is not necessarily true. 



3.3.3 Non-Stationary Processes 

A random process is non-stationary if its distributions or statistics vary with 
time. Most stochastic processes such as video signals, audio signals, 
financial data, meteorological data, biomedical signals, etc., are non- 
stationary, because they are generated by systems whose environments and 
parameters vary over time. For example, speech is a non-stationary process 
generated by a time-varying articulatory system. The loudness and the 
frequency composition of speech changes over time, and sometimes the 
change can be quite abrupt. Time-varying processes may be modelled by a 
combination of stationary random models as illustrated in Figure 3.5. In 
Figure 3.5(a) a non-stationary process is modelled as the output of a time- 
varying system whose parameters are controlled by a stationary process. In 
Figure 3.5(b) a time-varying process is modelled by a chain of time- 
invariant states, with each state having a different set of statistics or 
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State excitation 



o » 

Signal 

excitation 




Noise 



(Stationary) 

State model 

IT 

Time- varying — ►H-) 
signal model 






Figure 3.5 Two models for non-stationary processes: (a) a stationary process 
drives the parameters of a continuously time-varying model; (b) a finite-state 
model with each state having a different set of statistics. 



probability distributions. Finite state statistical models for time-varying 
processes are discussed in detail in Chapter 5. 



3.4 Expected Values of a Random Process 

Expected values of a process play a central role in the modelling and 
processing of signals. Furthermore, the probability models of a random 
process are usually expressed as functions of the expected values. For 
example, a Gaussian pdf is defined as an exponential function of the mean 
and the covariance of the process, and a Poisson pdf is defined in terms of 
the mean of the process. In signal processing applications, we often have a 
suitable statistical model of the process, e.g. a Gaussian pdf, and to complete 
the model we need the values of the expected parameters. Furthermore in 
many signal processing algorithms, such as spectral subtraction for noise 
reduction described in Chapter 1 1 , or linear prediction described in Chapter 
8, what we essentially need is an estimate of the mean or the correlation 
function of the process. The expected value of a function, h(X(m j), X(m 2 ), 
X(m M )), of a random process X is defined as 



OO CO 

‘E[h(X(m l ),...,X(m M ))] = J ••• jh(x l ,...,x M )f X{mi) ... X{mM) (x l ,...,x M )dx l ...dx M 

— OO — OO 

(3.34) 

The most important, and widely used, expected values are the mean value, 
the correlation, the covariance, and the power spectrum. 
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3.4.1 The Mean Value 

The mean value of a process plays an important part in signal processing 
and parameter estimation from noisy observations. For example, in Chapter 
3 it is shown that the optimal linear estimate of a signal from a noisy 
observation, is an interpolation between the mean value and the observed 
value of the noisy signal. The mean value of a random vector [X(m x ), ..., 

X(m M )\ is its average value across the ensemble of the process defined as 



oo oo 

‘E[X(m l ),...,X(m M )] = J J" > M ^ f X (mi)- - -X (m M ) (^1 

— OO — OO 



x M )dx l ---dx M 



(3.35) 



3.4.2 Autocorrelation 

The correlation function and its Fourier transform, the power spectral 
density, are used in modelling and identification of patterns and structures in 
a signal process. Correlators play a central role in signal processing and 
telecommunication systems, including predictive coders, equalisers, digital 
decoders, delay estimators, classifiers and signal restoration systems. The 
autocorrelation function of a random process X(m), denoted by r xx (m l ,m 2 ), is 

defined as 

r xx (m, ,m 2 ) = )x{m 2 )] 

oo oo 

)x(m 2 )fx{m x ),x{m x ) ( x ( m i X x ( m 2 )) dx ( m \ ) dx(m 2 ) 

— oo — oo 

(3.36) 

The autocorrelation function r xx (m x ,m 2 ) is a measure of the similarity, or the 
mutual relation, of the outcomes of the process X at time instants in { and m 2 . 
If the outcome of a random process at time in { bears no relation to that at 
time m 2 then X(m{) and X(m 2 ) are said to be independent or uncorrelated 
and r xx (ni\,m 2 )=Q. For a wide-sense stationary process, the autocorrelation 
function is time-invariant and depends on the time difference m= m 1 -m 2 : 




r xx( m i +T,m 2 +T) =r xx (m l ,m 2 ) = r xx (m l -m 2 )= r xx {m) 



(3.37) 
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The autocorrelation function of a real-valued wide-sense stationary process 
is a symmetric function with the following properties: 

r xx(- m ) = r xx (m ) 
r xx ( m ) ^ r xx ( 0 ) 

Note that for a zero-mean signal, r xt (0) is the signal power. 

Example 3.5 Autocorrelation of the output of a linear time-invariant (LTI) 
system. Let x(m), y(m ) and h(m ) denote the input, the output and the impulse 
response of a LTI system respectively. The input-output relation is given by 

y(m)=Y,h k x(m-k ) (3.40) 

k 

The autocorrelation function of the output signal y(m) can be related to the 
autocorrelation of the input signal x(m) by 

r w (£) = £[y(m)y(m + £)] 

-XZ h i hj‘E[x(m — i)x(m + k — j)] (3.41) 

i j 

='L'L h , h J r xx( k + i - j) 

i j 

When the input x(m) is an uncorrelated random signal with a unit variance, 
Equation (3.41) becomes 



(3.38) 

(3.39) 



r yy( k >'L h i h k+i 

i 



(3.42) 



3.4.3 Autocovariance 

The autocovariance function c xx (m v m 2 ) of a random process X(m) is measure 

of the scatter, or the dispersion, of the random process about the mean value, 
and is defined as 



( m i . m 2 ) = £[(*Oi ) - (mj ))(x(m 2 ) - jl x (m 2 ))] 
= r xx (m, ,m 2 )-ji x (m, )ji x (m 2 ) 



(3.43) 
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where j.i x (m) is the mean of X(m). Note that for a zero-mean process the 

autocorrelation and the autocovariance functions are identical. Note also that 
c xx (m\,ni\) is the variance of the process. For a stationary process the 

autocovariance function of Equation (3.43) becomes 



Cxxim i,m 2 )= c xx (m l -m 2 )=r xx (m l -m 2 )-n 



2 

x 



(3.44) 



3.4.4 Power Spectral Density 

The power spectral density (PSD) function, also called the power spectrum, 
of a random process gives the spectrum of the distribution of the power 
among the individual frequency contents of the process. The power 
spectrum of a wide sense stationary process X(m ) is defined, by the Wiener- 
Khinchin theorem in Chapter 9, as the Fourier transform of the 
autocorrelation function: 



P xx (f) =‘E[X(f)X*(f)] 

oo 

= X r xx(k) e~i 2 ^f m 

m=-oo 



(3.45) 



where r xx (m ) and Pxxif) are the autocorrelation and power spectrum of x(m) 

respectively, and / is the frequency variable. For a real-valued stationary 
process, the autocorrelation is symmetric, and the power spectrum may be 
written as 

oo 

Pxx(f) = r xx(®) + I 2 r xx (m) cos( 2 7Zfm) (3 .46) 

m = 1 



The power spectral density is a real- valued non-negative function, expressed 
in units of watts per hertz. From Equation (3.45), the autocorrelation 
sequence of a random process may be obtained as the inverse Fourier 
transform of the power spectrum as 

1/2 

i'xx (m) = J P X x (/) e l2nlm df (3.47) 

- 1/2 

Note that the autocorrelation and the power spectrum represent the second 
order statistics of a process in the time and frequency domains respectively. 
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A r X x(m) A Pxx(f) 




Figure 3.6 Autocorrelation and power spectrum of white noise. 



Example 3.6 Power spectrum and autocorrelation of white noise 
(Figure3.6). A noise process with uncorrelated independent samples is 
called a white noise process. The autocorrelation of a stationary white noise 
n(m) is defined as: 



r nn ( k) = ‘E [n(m)n(m + k)] = 



Noisepower k - 0 
0 k± 0 



(3.48) 



Equation (3.48) is a mathematical statement of the definition of an 
uncorrelated white noise process. The equivalent description in the 
frequency domain is derived by taking the Fourier transform of r nn (k ): 



oo 

p nn (/) = X r nn ( k)e ~ J 27# = r nn (0) =noise power (3.49) 

k=—° Q 

The power spectrum of a stationary white noise process is spread equally 
across all time instances and across all frequency bins. White noise is one of 
the most difficult types of noise to remove, because it does not have a 
localised structure either in the time domain or in the frequency domain. 

Example 3.7 Autocorrelation and power spectrum of impulsive noise. 
Impulsive noise is a random, binary-state (“on/off’) sequence of impulses of 
random amplitudes and random time of occurrence. In Chapter 12, a random 
impulsive noise sequence n,(m) is modelled as an amplitude-modulated 

random binary sequence as 

n ; (m)=ft(m)&(m) (3.50) 

where b(m) is a binary-state random sequence that indicates the presence or 
the absence of an impulse, and n(m) is a random noise process. Assuming 
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that impulsive noise is an uncorrelated process, the autocorrelation of 
impulsive noise can be defined as a binary-state process as 

r nn (k,m)= ‘E[n i (m)n i (m + k)]=of r 8(k)b(m) (3.51) 

where is the noise variance. Note that in Equation (3.51), the 

autocorrelation is expressed as a binary-state function that depends on the 
on/off state of impulsive noise at time m. The power spectrum of an 
impulsive noise sequence is obtained by taking the Fourier transform of the 
autocorrelation function: 

P NN (f,m)=alb(m) (3.52) 



3.4.5 Joint Statistical Averages of Two Random Processes 

In many signal processing problems, for example in processing the outputs 
of an array of sensors, we deal with more than one random process. Joint 
statistics and joint distributions are used to describe the statistical inter- 
relationship between two or more random processes. For two discrete-time 
random processes x(m ) and y(m), the joint pdf is denoted by 

fx(m l y--X(m M ),Y(n l )---Y(n N )( x l’---’ X M > Jl N ) (3.53) 

When two random processes, X(m ) and Y(m ) are uncorrelated, the joint pdf 
can be expressed as product of the pdfs of each process as 

/ X(m l y--X(m M ),Y(n l y--Y(n N )( x l’“-’ X M ’ Jh v • • X N ) ^ 

= fx(m l y--X(m M )( x l’---’ X M )jY(n l )—Y(n N ) ( >1 >• • • X N ) 

3.4.6 Cross-Correlation and Cross-Covariance 



The cross-correlation of two random process x(m) and y(m) is defined as 

r xy ( m i > m 2 ) = t E[x(m l )y(m 2 )] 

oo oo 

= J J x(m l )y(m 2 ) f x ix{m x ), y(m 2 ))dx(m l ) dy(m 2 ) 



— oo — oo 



(3.55) 
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For wide-sense stationary processes, the cross-correlation function 
r xy (yn^,m'i) depends only on the time difference m=m 1 -m 2 : 

r xy ( m \ + *,m 2 + T)=r xy (m x ,m 2 ) = r xy (m x -m 2 ) = r xy (m) (3.56) 

The cross-covariance function is defined as 



Cxy (m x ,m 2 ) = ‘E [(x(m x ) - fl x (m x ))(y(m 2 )-fl y (m 2 ))\ 

= r xy (rn x ,m 2 )~n x ( m x )fl y (m 2 ) 



(3.57) 



Note that for zero-mean processes, the cross-correlation and the cross- 
covariance functions are identical. For a wide-sense stationary process the 
cross-covariance function of Equation (3.57) becomes 

(m x ,m 2 ) = c ^ (/ m x -m 2 )= r xy (/ m x - m 2 )~H x H y (3.58) 

Example 3.8 Time-delay estimation. Consider two signals y x (m) and 
y 2 (m), each composed of an information bearing signal x(m) and an additive 
noise, given by 



y x (m)=x(m)+n x (m) (3.59) 

y 2 (m)-Ax(m-D)+n 2 (m ) (3.60) 

where A is an amplitude factor and D is a time delay variable. The cross- 
correlation of the signals y x (m) and y 2 (m) yields 




Figure 3.7 The peak of the cross-correlation of two delayed signals can be used to 

estimate the time delay D. 
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r yi y 2 (k) = ‘E[y l (m)y 2 (m + k)] 

=£{[x( m)+n,\ — D + k ^)~\~h 2 (m -I- k)] } (3.61) 

= ■ Ar xx ( k-D) + r xn2 ( k ) + Ar xni ( k-D) + (k) 

Assuming that the signal and noise are uncorrelated, we have 
r y (k) = Ar xx (k- D). As shown in Figure 3.7, the cross-correlation 

function has its maximum at the lag D. 



3.4.7 Cross-Power Spectral Density and Coherence 

The cross-power spectral density of two random processes X(m ) and Y(m ) is 
defined as the Fourier transform of their cross-correlation function: 

P xy (/) = £[X(/)F*(/)] 

~ (3.62) 

= I^(m)^ 2 ^ 

m =— oo 



Like the cross-correlation the cross-power spectral density of two processes 
is a measure of the similarity, or coherence, of their power spectra. The 
coherence, or spectral coherence, of two random processes is a normalised 
form of the cross-power spectral density, defined as 



CxY^f) 



PxAf) 

4^ XX (f ^Pyyif ) 



(3.63) 



The coherence function is used in applications such as time-delay estimation 
and signal-to-noise ratio measurements. 



3.4.8 Ergodic Processes and Time-Averaged Statistics 

In many signal processing problems, there is only a single realisation of a 
random process from which its statistical parameters, such as the mean, the 
correlation and the power spectrum can be estimated. In such cases, time- 
averaged statistics, obtained from averages along the time dimension of a 
single realisation of the process, are used instead of the “true” ensemble 
averages obtained across the space of different realisations of the process. 
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This section considers ergodic random processes for which time-averages 
can be used instead of ensemble averages. A stationary stochastic process is 
said to be ergodic if it exhibits the same statistical characteristics along the 
time dimension of a single realisation as across the space (or ensemble) of 
different realisations of the process. Over a very long time, a single 
realisation of an ergodic process takes on all the values, the characteristics 
and the configurations exhibited across the entire space of the process. For 
an ergodic process { x(m,s )}, we have 

statistical averages[x(m, ,v)| =statistical averages[x(m, ,v)| (3.64) 

along time m across space s 



where the statistical averages [.] function refers to any statistical operation 
such as the mean, the variance, the power spectrum, etc. 



3.4.9 Mean-Ergodic Processes 

The time- averaged estimate of the mean of a signal x(m) obtained from N 
samples is given by 

i N-\ 

fix =T7 X ) (3-65) 

™ m= 0 

A stationary process is said to be mean-ergodic if the time-averaged value of 
an infinitely long realisation of the process is the same as the ensemble- 
mean taken across the space of the process. Therefore, for a mean-ergodic 
process, we have 



lim *£ [fl x ] =Px (3.66) 

N ~ >oo 

lim var[jl x ]=0 (3.67) 

A^— >oo 

where p x is the “true” ensemble average of the process. Condition (3.67) is 
also referred to as mean-ergodicity in the mean square error (or minimum 
variance of error) sense. The time-averaged estimate of the mean of a signal, 
obtained from a random realisation of the process, is itself a random 
variable, with is own mean, variance and probability density function. If the 
number of observation samples N is relatively large then, from the central 
limit theorem the probability density function of the estimate fi x is 

Gaussian. The expectation of fi x is given by 
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£[/U=£ 



i N-l I i AM j AM 

— ^x(m) = — ^£[x(m)]=— 
ra =0 J ra =0 m =0 






(3.68) 



From Equation (3.68), the time-averaged estimate of the mean is unbiased. 
The variance of fi x is given by 



Var[/tJ = ‘E[/)2]-‘E 2 [/tJ 

= £[/t 2 ]-jU 2 



(3.69) 



Now the term £ [/t 2 ] in Equation (3.69) may be expressed as 



m 2 x ] 



= <E 



( ! AM 



V ! JV-1 



LV 



1 7 v 1/ 1 7 v 



m=0 
N-l / 



\ 



77 . n 

/V fc =0 



/J 



1 iV — J 

=E E 

/V ^ 

iv m =-(AM) 



1- 



m 



h 



V 



N 



r xx ( m ) 



y 



(3.70) 



Substitution of Equation (3.70) in Equation (3.69) yields 



Var[/^] = — 
77 



i At— 1 

=E E 

A T 



f 



1- 



I m I ^ 



v 



iV— 1 / 



N 



r xx (m)-ll 



X 



J 



1 — J 

=E y 

AT ^ 

ly m=-(N- 1) 



1- 



m 



h 



V 



N 



fxx ( m ) 



y 



(3.71) 



Therefore the condition for a process to be mean-ergodic, in the mean 
square error sense, is 



lim — f 



/V— >oo TV 



m=— (iV— 1) 



V 



TV 



c xx (m) = 0 



(3.72) 



y 



3.4.10 Correlation-Ergodic Processes 

The time-averaged estimate of the autocorrelation of a random process, 
estimated from N samples of a realisation of the process, is given by 
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i N - 1 

r xx {m) — — '5\x(k)x(k + m) (3.73) 

N k=0 

A process is correlation-ergodic, in the mean square error sense, if 

lim £ [ r ^ (m)] = r xx (m) (3.74) 

N—> °° 

lim Var [r xx (m)\ = 0 (3.75) 

N — >°° 

where r (m) is the ensemble-averaged autocorrelation. Taking the 
expectation of r xx (m) shows that it is an unbiased estimate, since 



^Vkxx («)] 




1 

N 



N-l 

Yjc(i)i(i + m) 
k=0 



The variance of r xx (m) is given by 



, N-l 

— X £ i x ( k ) x ( k + w)] = r x , (m) 

^ fc=o 

(3.76) 



V arff^ (m)] = £ [fj. (m)]-r; x ( m ) 



(3.77) 



The term £ [ f| v (m)] in Equation (3.77) may be expressed as 



£[A*(m)] = 



1 



N-l N-l 



N 



1 




£ [x(k)x(k + m)x(j)x(j + m )] 



k= o ,/=() 
am am 



A 




£[z(k,m)z(y',m)] 



i iV — J 

N , ~ 



&=0 j=0 

iV— 1 / 



A:=— iV+1 



1- 



k 



v 



TV 



\ 



7 



r zz (k,m) 



(3.78) 



where z(i,m)=x(i)x(i+m). Therefore the condition for correlation ergodicity 
in the mean square error sense is given by 



lim 

./V— >o o 



i N-l 

- Y 



l- 



k=-N+l 



N 



\ 



) 



r zz (k,m)-i\ 2 x (m) 



XX 



= 0 



(3.79) 
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3.5 Some Useful Classes of Random Processes 

In this section, we consider some important classes of random processes 
extensively used in signal processing applications for the modelling of 
signals and noise. 



3.5.1 Gaussian (Normal) Process 

The Gaussian process, also called the normal process, is perhaps the most 
widely applied of all probability models. Some advantages of Gaussian 
probability models are the following: 

(a) Gaussian pdfs can model the distribution of many processes 
including some important classes of signals and noise. 

(b) Non-Gaussian processes can be approximated by a weighted 
combination (i.e. a mixture) of a number of Gaussian pdfs of 
appropriate means and variances. 

(c) Optimal estimation methods based on Gaussian models often result 
in linear and mathematically tractable solutions. 

(d) The sum of many independent random processes has a Gaussian 
distribution. This is known as the central limit theorem. 

A scalar Gaussian random variable is described by the following probability 
density function: 



fx (*) = 







2a 



X 



(3.80) 



where fi x and a\ are the mean and the variance of the random variable x. 
The Gaussian process of Equation (3.80) is also denoted by C\C(x, fi x , a x ). 
The maximum of a Gaussian pdf occurs at the mean u , and is given by 



V27T a x 



(3.81) 
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Figure 3.8 Gaussian probability density and cumulative density functions. 



From Equation (3.80), the Gaussian pdf of x decreases exponentially with 
the increasing distance of x from the mean value j u x . The distribution 

function F(x) is given by 



( x ) — 



1 



42k 



o 



X 



1 



exp 






X 



oo 



V 



2(7 



x 



d/ 



(3.82) 



j 



Figure 3.8 shows the pdf and the cdf of a Gaussian model. 



3.5.2 Multivariate Gaussian Process 

Multivariate densities model vector-valued processes. Consider a P- variate 
Gaussian vector process {x=[x(m 0 ), x(m | ), . . ., v(mp_ 1 )] T } with mean vector 

(l x , and covariance matrix Z xx . The multivariate Gaussian pdf of x is given 
by 



f x (*) — 



1 



1 



2 Fx ) ^ XX t^X ) 



(2k) 



P/2 



il/2 



XX 



(3.83) 



where the mean vector (i x is defined as 
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' l E[x(m 0 )] ' 
‘E[x(m 2 )\ 



yE[x(m P _ x )] J 



(3.84) 



and the covariance matrix Z xx is given by 




c xx ( m 0 ’ m 0 ) 
c xx ( m i > m 0 ) 



c xx (m 0 , m, ) 
c xv (m\ , m l ) 






v C xx ( m P-l > m 0 ) 



c xx (m P _, , m x ) 



c xx (m P _ x ,m P _ x ) ) 



(3.85) 



The Gaussian process of Equation (3.83) is also denoted by 9£(x, n x , Z xx ). If 

the elements of a vector process are uncorrelated then the covariance matrix 
is a diagonal matrix with zeros in the off-diagonal elements. In this case the 
multivariate pdf may be described as the product of the pdfs of the 
individual elements of the vector: 



, v P-1 

fx[ x = [x(fn 0 ),---,x(mp_ l )y )=Yl 



1 



/=* ^ o xi 



exp 



< — 



[x(m ; - ) - fi xi ] 

2(4 



(3.86) 



Example 3.9 Conditional multivariate Gaussian probability density 
function. Consider two vector realisations x(m ) and y(m+k ) from two 
vector-valued correlated stationary Gaussian processes %C(x, fi x , Z xx ) and 

fA C(y, fly , Zyy )• The joint probability density function of x(m) and y(m+k) is 

a multivariate Gaussian density 9>C([x(m),y(m+k)], {J. ixyp with mean 

vector and covariance matrix given by 



P(x,y) 



Hx 




XX 


5 * 


yx 


1 



(3.87) 



(3.88) 
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The conditional density of x(m) given y(m+k) is given from Bayes’ rule as 



fx\Y (x (m)\ y (m + k)) = 



fx ,Y (x(m),y(m + k)) 
f Y {y(m + k)) 



( 3 . 89 ) 



It can be shown that the conditional density is also a multivariate Gaussian 
with its mean vector and covariance matrix given by 

H {x \ y) =‘E[x(m)\y(m + k)] 

=Vx + ZxyZyj (y-fly) 

I. I , = I -Z z~ x z 

UW xx xy yy yx 



( 3 . 90 ) 

( 3 . 91 ) 



3.5.3 Mixture Gaussian Process 

Probability density functions of many processes, such as speech, are non- 
Gaussian. A non-Gaussian pdf may be approximated by a weighted sum (i.e. 
a mixture) of a number of Gaussian densities of appropriate mean vectors 
and covariance matrices. An M-mixture Gaussian density is defined as 

M 

fx(x) ^PiKiiWxi^xxi) ( 3 - 92 ) 

1-1 




Figure 3.9 A mixture Gaussian pdf. 
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where C\C t (x, fi Xj ,X xx .) is a multivariate Gaussian density with mean vector 
fl x . and covariance matrix Z xx . , and P t are the mixing coefficients. The 

l l 

parameter P t is the prior probability of the / th mixture component, and is 
given by 




Jfi 

M 



(3.93) 





where yv,- is the number of observations associated with the mixture i. Figure 

3.9 shows a non-Gaussian pdf modelled as a mixture of five Gaussian pdfs. 
Algorithms developed for Gaussian processes can be extended to mixture 
Gaussian densities. 



3.5.4 A Binary-State Gaussian Process 

Consider a random process x(m) with two statistical states: such that in the 
state sq the process has a Gaussian pdf with mean /i x 0 and variance o x 0 , 

and in the state ,s'| the process is also Gaussian with mean fi x l and variance 

<7 X i (Figure 3.10). The state-dependent pdf of x(m) can be expressed as 



f xs {x{m)\s i ) = 



1 



4l7t 



exp 



< — 



a 



Xd 



1 

2a 






i=0, 1 (3.94) 



Xd 




Figure 3.10 Illustration of a binary-state Gaussian process 
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The joint probability distribution of the binary-valued state s { and the 
continuous-valued signal x(m ) can be expressed as 



fx.s (x{m), Si)=fx\s (*( m )| s i ) p s ) 



1 



-JlK 



exp 



a 



X,l 



1 1 

-or- t( m ) - Vxj J 2 \ p s ( s i ) 

2 



(3.95) 



where P s (s { ) is the state probability. For a multistate process we have the 

following probabilistic relations between the joint and marginal 
probabilities: 



(3-96) 

J/x,s (x(m),s i )dx = P s (s t ) 
x 

and 

X jfx,s( x ( m )’ s i) dx = l 

s X 

Note that in a multistate model, the statistical parameters of the process 
switch between a number of different states, whereas in a single-state 
mixture pdf, a weighted combination of a number of pdfs models the 
process. In Chapter 5 on hidden Markov models we consider multistate 
models with a mixture pdf per state. 



(3.97) 

(3.98) 



3.5.5 Poisson Process 

The Poisson process is a continuous-time, integer-valued counting process, 
used for modelling the occurrence of a random event in various time 
intervals. An important area of application of the Poisson process is in 
queuing theory for the analysis and modelling of the distributions of demand 
on a service facility such as a telephone network, a shared computer system, 
a financial service, a petrol station, etc. Other applications of the Poisson 
distribution include the counting of the number of particles emitted in 
physics, the number of times that a component may fail in a system, and 
modelling of radar clutter, shot noise and impulsive noise. Consider an 
event-counting process X(t), in which the probability of occurrence of the 
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event is governed by a rate function X(t), such that the probability that an 
event occurs in a small time interval At is 

Prob{ 1 occurrencein the interval (t, t + At)) =X{t)At (3.99) 

Assuming that in the small interval At, no more than one occurrence of the 
event is possible, the probability of no occurrence of the event in a time 
interval of At is given by 

Prob{ 0 occurrencein the interval^,? + At))= 1 - k(t)At (3. 100) 

when the parameter Mt) is independent of time, A(/)=A, and the process is 
called a homogeneous Poisson process. Now, for a homogeneous Poisson 
process, consider the probability of k occurrences of an event in a time 

interval of t+At, denoted by P(k, (0, t+At )): 

P(k,( 0, t + At)) = P(k, (0, t))P( 0, ( t , t + At )) + P(k - 1,(0, t))P(l, (t, t + At)) 

= P(k , (0, 0)(1 -AAt) + P(k- 1,(0, t))AAt 

(3.101) 

Rearranging Equation (3.101), and letting At tend to zero, we obtain the 
following linear differential equation: 

dP(k,t) = _xp(k,t) + XP(k-l,t) (3.102) 

dt 

where P(k,t)=P(k,( 0, /)). The solution of this differential equation is given 
by 



t 

P(k,t)=Ae~ Xt \P{k-\,T)e Xx dx (3.103) 

o 

Equation (3.103) can be solved recursively: starting with P(0,t)=e~^ and 
P( 1 , t)=Xt e~ M , we obtain the Poisson density 



P(k,t) 




e 



-At 



k\ 



(3.104) 
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From Equation (3.104), it is easy to show that for a homogenous Poisson 
process, the probability of k occurrences of an event in a time interval (/), t 2 ) 

is given by 

P[k,(h,t 2 )] = ~ fl)] (3.105) 

k\ 

A Poisson counting process X(t) is incremented by one every time the event 
occurs. From Equation (3.104), the mean and variance of a Poisson counting 
process X(t) are 



•E[X(t)] = At (3.106) 

r xx (h Ji ) = (?i ) A" ( f 2 )] = +^min(t 1 ,t 2 ) (3.107) 

Var[X(t)] = *£ [x 2 (t)]-E 2 [X(t)]=Xt (3.108) 

Note that the variance of a Poisson process is equal to its mean value. 



3.5.6 Shot Noise 

Shot noise happens when there is randomness in a directional flow of 
particles: as in the flow of electrons from the cathode to the anode of a 
cathode ray tube, the flow of photons in a laser beam, the flow and 
recombination of electrons and holes in semiconductors, and the flow of 
photoelectrons emitted in photodiodes. Shot noise has the form of a random 
pulse sequence. The pulse sequence can be modelled as the response of a 
linear filter excited by a Poisson-distributed binary impulse input sequence 
(Figure 3.11). Consider a Poisson-distributed binary-valued impulse process 
x(t). Divide the time axis into uniform short intervals of At such that only 
one occurrence of an impulse is possible within each time interval. Let 
x{mAt) be “1” if an impulse is present in the interval mAt to (m + 1 )At, and 
“0” otherwise. For x(mAt), we have 

*£ \x(mAt)\ = lx P(x(mAt) = l) +0 xP(x(mAt) = 0) = Xdt 



and 



(3.109) 
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Figure 3.11 Shot noise is modelled as the output of a filter excited with a process. 





‘E[x(mAt)x(nAt)] 



lx P(x(mAt) = 1)= AAt, m = n 

1 x P(x(mAt) = l))x P(x(nAt) = 1)= (AAt) 2 , m^n 

(3.110) 



A shot noise process y(m) is defined as the output of a linear system with an 
impulse response h(t), excited by a Poisson-distributed binary impulse input 
x(t): 



oo 

y(t)= \ x(T)h(t - z)dz 

— oo 
oo 

= \x(mAt)h(t - mAt ) 

k=—oo 



(3.111) 



where the binary signal x(mAt) can assume a value of 0 or 1. In Equation 
(3.111) it is assumed that the impulses happen at the beginning of each 
interval. This assumption becomes more valid as At becomes smaller. The 
expectation of y(t) is obtained as 



and 



oo 

‘E[y(t)]= y /E[x(mAt)]h(t - mAt) 

k=—oo 



oo 

= 'y' t AAth(t - mAt ) 

k=-oo 



(3.112) 



ryyih’h) =‘E[y(h)y(h)] 

00 00 (3 1 13) 

= E l£[ x(mAt)x(nAt)]h(ti — nzft)/r(t 2 - mAt) 



m =— oo n =— °° 
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Using Equation (3.1 10), the autocorrelation of y(t) can be obtained as 



OO CO oo 

r yy ~ I (AAt)h(t 1 -mAt)h(t 2 -m At) + I I (AAt ) 2 h(t l —mAt)h(t 2 —nAt) 

m =— oo wi — — °° n=—oo 

n^m 

(3.114) 

3.5.7 Poisson-Gaussian Model for Clutters and Impulsive Noise 

An impulsive noise process consists of a sequence of short-duration pulses 
of random amplitude and random time of occurrence whose shape and 
duration depends on the characteristics of the channel through which the 
impulse propagates. A Poisson process can be used to model the random 
time of occurrence of impulsive noise, and a Gaussian process can be used 
to model the random amplitude of the impulses. Finally, the finite duration 
character of real impulsive noise may be modelled by the impulse response 
of linear filter. The Poisson-Gaussian impulsive noise model is given by 



oo 

x(m)- ^ A k h(m-T k ) (3.115) 

k=—oo 

where h(m ) is the response of a linear filter that models the shape of 
impulsive noise, A k is a zero-mean Gaussian process of variance <7 2 and T k is 

a Poisson process. The output of a filter excited by a Poisson-distributed 
sequence of Gaussian amplitude impulses can also be used to model clutters 
in radar. Clutters are due to reflection of radar pulses from a multitude of 
background surfaces and objects other than the radar target. 



3.5.8 Markov Processes 

A first-order discrete-time Markov process is defined as one in which the 
state of the process at time m depends only on its state at time m-1 and is 
independent of the process history before m-1. In probabilistic terms, a first- 
order Markov process can be defined as 

f x {x(m) =x m \x(ni-X) = N) = x m - N ) 

(3.116) 

= fx\x(m)=x Jx(m-l) = x m _j) 
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e(m ) 

o — 



x(m) 




Figure 3.12 A first order autoregressive (Markov) process. 





The marginal density of a Markov process at time m can be obtained by 
integrating the conditional density over all values of x(m-l): 



oo 

fx (*( m ) = x rn)= jfx (• X(m)=x m I x(m - 1) = x m _ x ))f x (x(m - 1) = x m _ x ) dx m _ x 

— oo 

(3.117) 

A process in which the present state of the system depends on the past n 
states may be described in terms of n first-order Markov processes and is 

known as an n^ 1 order Markov process. The term “Markov process” usually 
refers to a first order process. 

Example 3.10 A simple example of a Markov process is a first-order auto- 
regressive process (Figure 3.12) defined as 

x(m)=ax(m-l)+e(m) (3.118) 



In Equation (3.118), x(m) depends on the previous value x(m- 1) and the 
input e(m). The conditional pdf of x(m) given the previous sample value can 
be expressed as 



f x (x(m)\x(m — l),...,x(m — N))=f x (x(m)|x(m-l)) 

=f E (e(m) = x(m) —ax(m - 1)) 



(3.119) 



where f E (e(m)) is the pdf of the input signal e(m). Assuming that input e(m) 
is a zero-mean Gaussian process with variance a | , we have 




Some Useful Classes of Random Processes 



79 




Figure 3.13 A Markov chain model of a four-state discrete-time Markov process. 



f x ( x(m)\x(m - 1)..., x(m — N))=f x (x(m)\x(m - 1) ) 

=f E (x(m) -ax(m — 1)) 

i r i 

=-}= — ex p — t 

■sl2no e L 2(J e 

(3.120) 

When the input to a Markov model is a Gaussian process the output is 
known as a Gauss-Markov process. 

3.5.9 Markov Chain Processes 

A discrete-time Markov process x(m) with N allowable states may be 
modelled by a Markov chain of N states (Figure 3.13). Each state can be 
associated with one of the N values that x(m) may assume. In a Markov 
chain, the Markovian property is modelled by a set of state transition 
probabilities defined as 
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a i j(m — \,m)=Prob{x(m) = j\x(m — l) = i ) (3.121) 

where a (m,m- 1) is the probability that at time m— 1 the process is in the 
state i and then at time m it moves to state j. In Equation (3.121), the 
transition probability is expressed in a general time-dependent form. The 
marginal probability that a Markov process is in the state j at time m, Pj(m), 

can be expressed as 



N 

Pj(m) — ^P,(m — l)a,-,(m-l,m) (3.122) 

i = 1 

A Markov chain is defined by the following set of parameters: 



number of states N 
state probability vector 

P T (m) =[p ] (m),p 2 (m),. ,.,p N (m)] 



and the state transition matrix 








' a n (m — l,m) 


a 12( m — l, m ) 


... a w (m-l,m)^ 


A(m-l,m)= 


a 2 i(m — l,m) 

• 

• 

• 


a 2 2(m - 1 ,m) 

• 

• 

• 


• • • a 2N (m - 1 ,m) 

• • 

• • 

• • 






a N2 (m-\,m) 


... a NN (m-\,m) ^ 



Homogenous and Inhomogeneous Markov Chains 

A Markov chain with time-invariant state transition probabilities is known 
as a homogenous Markov chain. For a homogenous Markov process, the 
probability of a transition from state i to state j of the process is independent 
of the time of the transition m, as expressed in the following equation: 

Prob(x(m) = j\x(m — 1) = i ) = a ( - (m — 1, m) = (3. 123) 

Inhomgeneous Markov chains have time-dependent transition probabilities. 
In most applications of Markov chains, homogenous models are used 
because they usually provide an adequate model of the signal process, and 
because homogenous Markov models are easier to train and use. Markov 
models are considered in Chapter 5 . 
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x(m) 




h[x(m )] 



y(m) 



Figure 3.14 Transformation of a random process x(m) to an output process y(m). 



3.6 Transformation of a Random Process 

In this section we consider the effect of filtering or transformation of a 
random process on its probability density function. Figure 3.14 shows a 
generalised mapping operator h( ■ ) that transforms a random input process X 
into an output process Y. The input and output signals x(m) and y(m) are 
realisations of the random processes X and Y respectively. If x(m) and y(m) 
are both discrete- valued such that x(m) e {X|,...,v iV } and y( in) e {y\,...,y M } 

then we have 



P Y (y(m) = yj ) = ^ P x (x(m) = x t ) (3.124) 



where the summation is taken over all values of x(m) that map to y(m)=y . 
Now consider the transformation of a discrete-time, continuous-valued , 
process. The probability that the output process Y has a value in the range 
y(m)<Y<y(m)+Ay is 



Prob[y(m)< Y < 



y{m) + Ay] = [ f x (x(m))dx(m) (3.125) 

Jx(m)\y (m)<Y <y (m)+Ay 



where the integration is taken over all the values of x(m) that yield an output 
in the range y(m ) to y(m)+Ay . 



3.6.1 Monotonic Transformation of Random Processes 

Now for a monotonic one-to-one transformation y(m)=h[x(m )] (e.g. as in 
Figure 3.15) Equation (3.125) becomes 



Prob(y(m)< Y < y(m) + Ay)= Prob{x{m)< X < x(m) + Ax) (3.126) 
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Figure 3.15 An example of a monotonic one-to-one mapping. 



or, in terms of the cumulative distribution functions 

F Y ( y(m ) + Ay)- F Y ( y(m))=F x (x(ra) + Ax)- F x (x(ra)) (3.127) 



Multiplication of the left-hand side of Equation (3.127) by Ay/ Ay and the 
right-hand side by Ax/ Ax and re-arrangement of the terms yields 

Fy ( y(m ) + Ay)- F Y (y(m)) _ Ax F x (x(m) + Ax)- F x 

Ay Ay Ax 



(x(m)) 



(3.128) 



Now as the intervals Ax and Ay tend to zero, Equation (3.128) becomes 



fy (y(m)) 



dx(m) 

dy{m) 



fx ( x ( m )) 



(3.129) 



where fy{y{m)) is the probability density function. In Equation (3.129), 
substitution of x(m)=h~ l (y(m)) yields 



fy ( y ( m )) 



cFh ] Jj(m)) 
dy(m) 



f x (h l (y(m)j) 



(3.130) 



Equation (3.130) gives the pdf of the output signal in terms of the pdf of the 
input signal and the transformation. 
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Example 3.11 Transformation of a Gaussian process to a log-normal 
process. Log-normal pdfs are used for modelling positive- valued processes 
such as power spectra. If a random variable x(m) has a Gaussian pdf as in 
Equation (3.80) then the non-negative valued variable y(m)=exp(x(m)) has a 
log-normal distribution (Figure 3.16) obtained using Equation (3.130) as 



./VO’) = 



1 



42 k o x y(m ) 



exp 



< — 



[In y(m)~ /Li x ] 



2a 



X 



(3.131) 



Conversely, if the input y to a logarithmic function has a log-normal 
distribution then the output x=ln y is Gaussian. The mapping functions for 
translating the mean and variance of a log-normal distribution to a normal 
distribution can be derived as 





(3.132) 



a 



2 




(3.133) 



and ( /d ,oj ) are the mean and variance of x and y respectively. 

The inverse mapping relations for the translation of mean and variances of 
normal to log-normal variables are 



(3.134) 



Ai y =exp(Ai JC -KrJ/2) 
o 2 y =n x [exp(<7 y ) - 1] 



(3.135) 
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Figure 3.17 Illustration of a many to one transformation. 



3.6.2 Many-to-One Mapping of Random Signals 

Now consider the case when the transformation h(-) is a non-monotonic 
function such as that shown in Figure 3.17. Assuming that the equation 
y(m)=h[x(m )] has K roots, there are K different values of x(m) that map to 
the same y(m). The probability that a realisation of the output process Y has 
a value in the range y(m) to y(m)+Ay is given by 

K 

Prob(y(m) <Y < y(m ) + Ay)- V Prob(x k ( m ) < X < x k (m) + Ax k ) (3.136) 

k - 1 



where xfc is the /c th root of y(m)=h(x(m)). Similar to the development in 
Section 3.6.1, Equation (3.136) can be written as 



F y ( y(m ) + Ay)— F y (y(m)) ^ F x (x k (m) + Ax k )- F x (x k ( m )) Jx * 



Ay 



k = i 



Ax 



Equation (3.137) can be rearranged as 



(3.137) 



F Y (y(m) + Ay)-F Y (y(m)) _ ^ Ax k F x (x k ( m ) + Ax k )-F x (x k (m)) 

Ay h J y 



Ax 



(3.138) 

Now as the intervals Ax and Ay tend to zero Equation (3.138) becomes 
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K 



/y(y("0)=X 

k= 1 
K 

=s 



dx k ( m ) 



dy(m) 

1 



|/t'(x* (m)) 



OO) 



fx ( x k (m)) 



(3.139) 



where h'(x k (m)) — c)h(x k ( m))/dx k ( m ) . Note that for a monotonic function, 

K= 1 and Equation (3.139) becomes the same as Equation (3.130). Equation 
(3.139) can be expressed as 

K 

/r(T('«))=Xl- / ( x fc( m ))r 1 /x( x yt( m )) (3.140) 

k = 1 



where J(x k (m)) - h'(x k (m)) is called the Jacobian of the transformation. 
For a multi- variate transformation of a vector-valued process such as 

y(m) =H(x(m)) (3.141) 



the pdf of the output y(m ) is given by 

K 

fv ( y ( m )) =2 1 J ( x k ( m )) l _1 fx ( x k ( m >) (3.142) 

k - 1 



where L/(x)l, the Jacobian of the transformation //(•), is the determinant of a 
matrix of derivatives: 



L/(*)l = 



dx | <?x 2 

• • 

• • 

• • 

dyp dyp 

c)x\ dx 2 



dy\ 

dx P 

dyp 

dx D 



(3.143) 



For a monotonic linear vector transformation such as 



the pdf of y becomes 



y - Hx 



f Y (y)=\J ~'fx 




(3.144) 

(3.145) 



where LAI is the Jacobian of the transformation. 
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Example 3.12 The input-output relation of a Px P linear transformation 
matrix H is given by 



y=Hx 



(3.146) 



The Jacobian of the linear transformation H is 1/71. Assume that the input x 
is a zero-mean Gaussian P- variate process with a covariance matrix of E xx 

and a probability density function given by: 



fx (*) = 



1 



(2k) 



PI 2 



XX 



1/2 



exp 



_ 1 Ty-l 
X v v X 

2 



XX 



(3.147) 



From Equations (3.145)-(3.147), the pdf of the output y is given by 



fy (J)=- 



1 



(2 n) 



PI 2 



XX 



1/2 



exp 



( 1 T i T i i A 






H 



-l 



/ 



1 



(2/r) 



PI 2 



xx 



1/2 



H 



exp 



r 






_ 1 T y. -1 
2^ 



x 






(3.148) 



where E yy -HE XX H 1 . Note that a linear transformation of a Gaussian 
process yields another Gaussian process. 



3.7 Summary 

The theory of statistical processes is central to the development of signal 
processing algorithms. We began this chapter with basic definitions of 
deterministic signals, random signals and random processes. A random 
process generates random signals, and the collection of all signals that can 
be generated by a random process is the space of the process. Probabilistic 
models and statistical measures, originally developed for random variables, 
were extended to model random signals. Although random signals are 
completely described in terms of probabilistic models, for many 
applications it may be sufficient to characterise a process in terms of a set of 
relatively simple statistics such as the mean, the autocorrelation function, 
the covariance and the power spectrum. Much of the theory and application 
of signal processing is concerned with the identification, extraction, and 
utilisation of structures and patterns in a signal process. The correlation and 
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its Fourier transform the power spectrum are particularly important because 
they can be used to identify the patterns in a stochastic process. 

We considered the concepts of stationary, ergodic stationary and non- 
stationary processes. The concept of a stationary process is central to the 
theory of linear time-invariant systems, and furthermore even non-stationary 
processes can be modelled with a chain of stationary subprocesses as 
described in Chapter 5 on hidden Markov models. For signal processing 
applications, a number of useful pdfs, including the Gaussian, the mixture 
Gaussian, the Markov and the Poisson process, were considered. These pdf 
models are extensively employed in the remainder of this book. Signal 
processing normally involves the filtering or transformation of an input 
signal to an output signal. We derived general expressions for the pdf of the 
output of a system in terms of the pdf of the input. We also considered some 
applications of stochastic processes for modelling random noise such as 
white noise, clutters, shot noise and impulsive noise. 
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BAYESIAN ESTIMATION 



4.1 Bayesian Estimation Theory: Basic Definitions 

4.2 Bayesian Estimation 

4.3 The Estimate-Maximise Method 

4.4 Cramer-Rao Bound on the Minimum Estimator Variance 

4.5 Design of Mixture Gaussian Models 

4.6 Bayesian Classification 

4.7 Modeling the Space of a Random Process 

4.8 Summary 

B ayesian estimation is a framework for the formulation of statistical 
inference problems. In the prediction or estimation of a random 
process from a related observation signal, the Bayesian philosophy is 
based on combining the evidence contained in the signal with prior 
knowledge of the probability distribution of the process. Bayesian 
methodology includes the classical estimators such as maximum a posteriori 
(MAP), maximum-likelihood (ML), minimum mean square error (MMSE) 
and minimum mean absolute value of error (MAVE) as special cases. The 
hidden Markov model, widely used in statistical signal processing, is an 
example of a Bayesian model. Bayesian inference is based on minimisation 
of the so-called Bayes’ risk function, which includes a posterior model of 
the unknown parameters given the observation and a cost-of-error function. 
This chapter begins with an introduction to the basic concepts of estimation 
theory, and considers the statistical measures that are used to quantify the 
performance of an estimator. We study Bayesian estimation methods and 
consider the effect of using a prior model on the mean and the variance of an 
estimate. The estimate-maximise (EM) method for the estimation of a set of 
unknown parameters from an incomplete observation is studied, and applied 
to the mixture Gaussian modelling of the space of a continuous random 
variable. This chapter concludes with an introduction to the Bayesian 
classification of discrete or finite-state signals, and the K-means clustering 
method. 
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4.1 Bayesian Estimation Theory: Basic Definitions 

Estimation theory is concerned with the determination of the best estimate 
of an unknown parameter vector from an observation signal, or the recovery 
of a clean signal degraded by noise and distortion. For example, given a 
noisy sine wave, we may be interested in estimating its basic parameters 
(i.e. amplitude, frequency and phase), or we may wish to recover the signal 
itself. An estimator takes as the input a set of noisy or incomplete 
observations, and, using a dynamic model (e.g. a linear predictive model) 
and/or a probabilistic model (e.g. Gaussian model) of the process, estimates 
the unknown parameters. The estimation accuracy depends on the available 
information and on the efficiency of the estimator. In this chapter, the 
Bayesian estimation of continuous- valued parameters is studied. The 
modelling and classification of finite-state parameters is covered in the next 
chapter. 

Bayesian theory is a general inference framework. In the estimation or 
prediction of the state of a process, the Bayesian method employs both the 
evidence contained in the observation signal and the accumulated prior 
probability of the process. Consider the estimation of the value of a random 
parameter vector G, given a related observation vector y. From Bayes’ rule 
the posterior probability density function (pdf) of the parameter vector G 
giveny, f q\y(G I y) , can be expressed as 



few (0 1 y) 



fy \0 (y !0)/e(0) 
/yOO 



(4.1) 



where for a given observation, f Y (y) is a constant and has only a normalising 
effect. Thus there are two variable terms in Equation (4.1): one term 
/y'ig(yl6>) is the likelihood that the observation signal y was generated by the 

parameter vector G and the second term is the prior probability of the 
parameter vector having a value of G. The relative influence of the 
likelihood pdf /^g/ylO) and the prior pdf /©(0) on the posterior pdf /g(j40ly) 

depends on the shape of these function, i.e. on how relatively peaked each 
pdf is. In general the more peaked a probability density function, the more it 
will influence the outcome of the estimation process. Conversely, a uniform 
pdf will have no influence. 

The remainder of this chapter is concerned with different forms of Bayesian 
estimation and its applications. First, in this section, some basic concepts of 
estimation theory are introduced. 
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4.1.1 Dynamic and Probability Models in Estimation 

Optimal estimation algorithms utilise dynamic and statistical models of the 
observation signals. A dynamic predictive model captures the correlation 
structure of a signal, and models the dependence of the present and future 
values of the signal on its past trajectory and the input stimulus. A statistical 
probability model characterises the random fluctuations of a signal in terms 
of its statistics, such as the mean and the covariance, and most completely in 
terms of a probability model. Conditional probability models, in addition to 
modelling the random fluctuations of a signal, can also model the 
dependence of the signal on its past values or on some other related process. 

As an illustration consider the estimation of a P-dimensional parameter 
vector 0=[6 {) ,d { , G P _ | J from a noisy observation vector y=[y(0), y( 1 ), ..., 
y(N- 1 )] modelled as 



y=h(6 ,x,e)+n (4.2) 

where, as illustrated in Figure 4.1, the function h (- ) with a random input e, 
output x, and parameter vector 0, is a predictive model of the signal x, and n 
is an additive random noise process. In Figure 4.1, the distributions of the 
random noise n, the random input e and the parameter vector 0 are modelled 
by probability density functions, f N (n),f E (e), and/ 0 (0) respectively. The pdf 

model most often used is the Gaussian model. Predictive and statistical 
models of a process guide the estimator towards the set of values of the 
unknown parameters that are most consistent with both the prior distribution 
of the model parameters and the noisy observation. In general, the more 
modelling information used in an estimation process, the better the results, 
provided that the models are an accurate characterisation of the observation 
and the parameter process. 




Figure 4.1 A random process y is described in terms of a predictive model h(-), 

and statistical models /£{•), f @(-) and ///{•). 
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4.1.2 Parameter Space and Signal Space 

Consider a random process with a parameter vector G. For example, each 
instance of G could be the parameter vector for a dynamic model of a speech 
sound or a musical note. The parameter space of a process 0 is the 
collection of all the values that the parameter vector G can assume. The 
parameters of a random process determine the “character” (i.e. the mean, the 
variance, the power spectrum, etc.) of the signals generated by the process. 
As the process parameters change, so do the characteristics of the signals 
generated by the process. Each value of the parameter vector G of a process 
has an associated signal space Y; this is the collection of all the signal 
realisations of the process with the parameter value G. For example, 
consider a three-dimensional vector-valued Gaussian process with parameter 
vector G=\_H, Z], where /I is the mean vector and Xis the covariance matrix 
of the Gaussian process. Figure. 4.2 illustrates three mean vectors in a three- 
dimensional parameter space. Also shown is the signal space associated 
with each parameter. As shown, the signal space of each parameter vector of 
a Gaussian process contains an infinite number of points, centred on the 
mean vector fl, and with a spatial volume and orientation that are 
determined by the covariance matrix Z. For simplicity, the variances are not 
shown in the parameter space, although they are evident in the shape of the 
Gaussian signal clusters in the signal space. 



Figure 4.2 Illustration of three points in the parameter space of a Gaussian process 
and the associated signal spaces, for simplicity the variances are not shown in 

parameter space. 



A M i Parameter space 



A v. Signal space 

Mapping AC(y,fh,Zi) 
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4.1 .3 Parameter Estimation and Signal Restoration 

Parameter estimation and signal restoration are closely related problems. 
The main difference is due to the rapid fluctuations of most signals in 
comparison with the relatively slow variations of most parameters. For 
example, speech sounds fluctuate at speeds of up to 20 kHz, whereas the 
underlying vocal tract and pitch parameters vary at a relatively lower rate of 
less than 100 Hz. This observation implies that normally more averaging 
can be done in parameter estimation than in signal restoration. 

As a simple example, consider a signal observed in a zero-mean random 
noise process. Assume we wish to estimate (a) the average of the clean 
signal and (b) the clean signal itself. As the observation length increases, the 
estimate of the signal mean approaches the mean value of the clean signal, 
whereas the estimate of the clean signal samples depends on the correlation 
structure of the signal and the signal-to-noise ratio as well as on the 
estimation method used. 

As a further example, consider the interpolation of a sequence of lost 
samples of a signal given N recorded samples, as illustrated in Figure 4.3. 
Assume that an autoregressive (AR) process is used to model the signal as 

y — X6 + e +n (4.3) 

where y is the observation signal, X is the signal matrix, 0 is the AR 
parameter vector, e is the random input of the AR model and n is the 
random noise. Using Equation (4.3), the signal restoration process involves 
the estimation of both the model parameter vector 6 and the random input e 
for the lost samples. Assuming the parameter vector 9 is time-invariant, the 
estimate of 6 can be averaged over the entire N observation samples, and as 
N becomes infinitely large, a consistent estimate should approach the true 




Figure 4.3 Illustration of signal restoration using a parametric model of the 

signal process. 
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parameter value. The difficulty in signal interpolation is that the underlying 
excitation e of the signal x is purely random and, unlike G, it cannot be 
estimated through an averaging operation. In this chapter we are concerned 
with the parameter estimation problem, although the same ideas also apply 
to signal interpolation, which is considered in Chapter 1 1 . 

4.1.4 Performance Measures and Desirable Properties of 
Estimators 

In estimation of a parameter vector G from N observation samples y, a set of 
performance measures is used to quantify and compare the characteristics of 
different estimators. In general an estimate of a parameter vector is a 
function of the observation vector y, the length of the observation N and the 
process model M. This dependence may be expressed as 



G -f (y,N,M) (4.4) 

Different parameter estimators produce different results depending on the 
estimation method and utilisation of the observation and the influence of the 
prior information. Due to randomness of the observations, even the same 
estimator would produce different results with different observations from 
the same process. Therefore an estimate is itself a random variable, it has a 
mean and a variance, and it may be described by a probability density 
function. However, for most cases, it is sufficient to characterise an 
estimator in terms of the mean and the variance of the estimation error. The 
most commonly used performance measures for an estimator are the 
following: 

y\ 

(a) Expected value of estimate: E[G] 

(b) Bias of estimate: ‘E[G - G \ =< EYQ\ - G 

(c) Covariance of estimate: Cov[0] = E[(G — E[G])(G - *E[0]) T ] 

Optimal estimators aim for zero bias and minimum estimation error 
covariance. The desirable properties of an estimator can be listed as follows: 

(a) Unbiased estimator: an estimator of G is unbiased if the expectation 
of the estimate is equal to the true parameter value: 

<£[0] = G 



(4.5) 
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An estimator is asymptotically unbiased if for increasing length of 
observations N we have 

lim £ [0] = 6 (4.6) 

N — >oo 

(b) Efficient estimator: an unbiased estimator of 6 is an efficient 
estimator if it has the smallest covariance matrix compared with all 
other unbiased estimates of 6 : 



Coefficient] ^ C °V[0] 



(4.7) 



A 

where 6 is any other estimate of 6 . 

(c) Consistent estimator: an estimator is consistent if the estimate 
improves with the increasing length of the observation N, such that 

A 

the estimate G converges probabilistically to the true value 0 as N 
becomes infinitely large: 



lim P[\G-6 1> £]=0 (4.8) 

N—> 00 



where e is arbitrary small. 

Example 4.1 Consider the bias in the time-averaged estimates of the mean 
p y and the variance 0j of N observation samples [v(0), ..., y(/V-l )], of an 

ergodic random process, given as 



A* 



y 



1 

N 



N - 1 

^y(m) 

m = 0 




1 

N 



N-l 

Yj\y(m)- fly \ 2 

m = 0 



(4.9) 



(4.10) 



It is easy to show that fi y is an unbiased estimate, since 




1 

N 



N-l 

^[y(m)]=Py 

m = 0 



(4.11) 
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Figure 4.4 Illustration of the decrease in the bias and variance of an asymptotically 
unbiased estimate of the parameter 0 with increasing length of observation. 



The expectation of the estimate of the variance can be expressed as 




From Equation (4.12), the bias in the estimate of the variance is inversely 
proportional to the signal length N, and vanishes as N tends to infinity; 
hence the estimate is asymptotically unbiased. In general, the bias and the 
variance of an estimate decrease with increasing number of observation 
samples N and with improved modelling. Figure 4.4 illustrates the general 
dependence of the distribution and the bias and the variance of an 
asymptotically unbiased estimator on the number of observation samples N. 

4.1.5 Prior and Posterior Spaces and Distributions 

The prior space of a signal or a parameter vector is the collection of all 
possible values that the signal or the parameter vector can assume. The 
posterior signal or parameter space is the subspace of all the likely values 
of a signal or a parameter consistent with both the prior information and the 
evidence in the observation. Consider a random process with a parameter 
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Figure 4.5 Illustration of joint distribution of signal yand parameter Sand the 

posterior distribution of 0 given y. 



space ©observation space Y and a joint pdf/y Q (y, 6). From the Bayes’ rule 

the posterior pdf of the parameter vector 0, given an observation vector y, 
f o\ Y (0 I y ) , can be expressed as 



fY\e(y\o)f ( 0 ) 

f©\Y (#l.y) = ( 

f Y (y) 

_ /f|0(j|^)/0(^) 

I /y 10 ( y\o )f© (9 ) d® 
0 



(4.13) 



where, for a given observation vector y, the pdf f Y (y) is a constant and has 
only a normalising effect. From Equation (4.13), the posterior pdf is 
proportional to the product of the likelihood fy\<~fy \Q) that the observation y 

was generated by the parameter vector 6, and the prior pdf / q(0). The prior 

pdf gives the unconditional parameter distribution averaged over the entire 
observation space as 



fe(0) = jf Y ,&(y,0)dy 
Y 



(4.14) 



98 



Bayesian Estimation 



For most applications, it is relatively convenient to obtain the likelihood 
function fy\(fy\0)- The prior pdf influences the inference drawn from the 
likelihood function by weighting it with f q(6). The influence of the prior 

is particularly important for short-length and/or noisy observations, where 
the confidence in the estimate is limited by the lack of a sufficiently long 
observation and by the noise. The influence of the prior on the bias and the 
variance of an estimate are considered in Section 4.4.1. 

A prior knowledge of the signal distribution can be used to confine the 
estimate to the prior signal space. The observation then guides the estimator 
to focus on the posterior space: that is the subspace consistent with both the 
prior and the observation. Figure 4.5 illustrates the joint pdf of a signal y(m) 
and a parameter 0. The prior pdf of 6 can be obtained by integrating 
f n0 (y(m)\6) with respect to y(m). As shown, an observation y(m) cuts a 

posterior pdf/@|y(0l_y(m)) through the joint distribution. 

Example 4.2 A noisy signal vector of length N samples is modelled as 

y(m)=x(m)+n(m) (4.15) 

Assume that the signal x(m) is Gaussian with mean vector [l x and covariance 
matrix Z xx , and that the noise n(m) is also Gaussian with mean vector jl n 
and covariance matrix Z nn . The signal and noise pdfs model the prior spaces 

of the signal and the noise respectively. Given an observation vector y(m), 
the underlying signal x (m) would have a likelihood distribution with a mean 
vector of y(m) - jl n and covariance matrix Z nn as shown in Figure 4.6.The 
likelihood function is given by 



fy\ X (j’( m )l*( m )) = /jv hi m )~x{m)) 



1 



(2 k) 



N/2 



1 1 / 2 



nn 



l l T -1 

exp j --[x(m)-(y(m)~ p n ) ] I nn [x(m) ~(y(m) - p n )\ 



(4.16) 



where the terms in the exponential function have been rearranged to 
emphasize the illustration of the likelihood space in Figure 4.6. Hence the 
posterior pdf can be expressed as 
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Figure 4.6 Sketch of a two-dimensional signal and noise spaces, and the 
likelihood and posterior spaces of a noisy observation y. 



f x« (* 0)1 y( m )) = 



1 



f Ytx (y(m)\x(m))f x (x(m)) 

f Y ( y(m )) 



f Y (JW)( 2u) N \E nn 



1/2 



XX 



1/2 



Xexp -i{[x(w) - (y (/n) - /!„ ) ] T X J [x(m) - (y(/n) - fi n ) ]+(x(m) - [l x ) T X J (x(m) — jU^ ) } 



(4.17) 



For a two-dimensional signal and noise process, the prior spaces of the 
signal, the noise, and the noisy signal are illustrated in Figure 4.6. Also 
illustrated are the likelihood and posterior spaces for a noisy observation 
vector y. Note that the centre of the posterior space is obtained by 
subtracting the noise mean vector from the noisy signal vector. The clean 
signal is then somewhere within a subspace determined by the noise 
variance. 
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4.2 Bayesian Estimation 

The Bayesian estimation of a parameter vector G is based on the 
minimisation of a Bayesian risk function defined as an average cost-of-error 
function: 



‘JOG) = nC(G,G )] 

= \ Q J y C(G,G)f Y>0 ( y ,G) dy dG 

= J J C{G,G)f m ( 0\y)f Y (y) dy dG 



(4.18) 



where the cost-of-error function C(G,G ) allows the appropriate weighting of 

the various outcomes to achieve desirable objective or subjective properties. 
The cost function can be chosen to associate a high cost with outcomes that 
are undesirable or disastrous. For a given observation vector y, f Y (y) is a 

constant and has no effect on the risk-minimisation process. Hence Equation 
(4.18) may be written as a conditional risk function: 

$se (y) = \ 0 C(G,G)f 0W (G (y) dG (4.19) 

The Bayesian estimate obtained as the minimum-risk parameter vector is 
given by 



A 



G 



Bayesian 



y\ 

= argmin^(0 I y) = argmin 



e 



e 



\ e C{G,G)f m ( G I y)dG 



(4.20) 



Using Bayes’ rule, Equation (4.20) can be written as 



G 



Bayesian min 

6 



e 



C(G,G)f Yl0 (y\G)f 0 (G)dG 



(4.21) 



Assuming that the risk function is differentiable, and has a well-defined 
minimum, the Bayesian estimate can be obtained as 



a MUB i y) 

^Bayesian = ar g zero 7^ = arg zero 

e oQ § 



d 



dG Je 



\c(G,G)f Y]0 (y\G)f 0 (G)dG 



(4.22) 
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Figure 4.7 Illustration of the Bayesian cost function for the MAP estimate. 



4.2.1 Maximum A Posteriori Estimation 

y\ 

The maximum a posteriori (MAP) estimate 0 MAP is obtained as the 
parameter vector that maximises the posterior pdf / 0 | F (0| J). The MAP 

estimate corresponds to a Bayesian estimate with a so-called uniform cost 
function (in fact, as shown in Figure 4.7 the cost function is notch-shaped) 
defined as 

C(0,0) = l- 8(0,9) (4.23) 

A 

where 6(0,0) is the Kronecker delta function. Substitution of the cost 
function in the Bayesian risk equation yields 

K MAP (6 lj) = U- <5 (0,0)] /©IF (0 \y) de 

Jd ^ (4.24) 

— I- fo\Y (& bO 



From Equation (4.24), the minimum Bayesian risk estimate corresponds to 
the parameter value where the posterior function attains a maximum. Hence 
the MAP estimate of the parameter vector 0 is obtained from a minimisation 
of the risk Equation (4.24) or equivalently maximisation of the posterior 
function: 

6 map = argmax (0ly) 

= argmax[/ (yl0)/ (0)] 

0 y\u t) 



(4.25) 
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4.2.2 Maximum-Likelihood Estimation 



The maximum-likelihood (ML) estimate 0 ML is obtained as the parameter 
vector that maximises the likelihood function f Y \q (y 10) . The ML estimator 

corresponds to a Bayesian estimator with a uniform cost function and a 
uniform parameter prior pdf: 



\y)=j d n-8(e,e)]f YB (y i e)f 0 (O)de 

= const. [l-/ r ,0 (y 10 )] 



(4.26) 



where the prior function /©(0)=const. From a Bayesian point of view the 
main difference between the ML and MAP estimators is that the ML 
assumes that the prior pdf of G is uniform. Note that a uniform prior, in 
addition to modelling genuinely uniform pdfs, is also used when the 
parameter prior pdf is unknown, or when the parameter is an unknown 
constant. 

From Equation (4.26), it is evident that minimisation of the risk 
function is achieved by maximisation of the likelihood function: 

6 ml = arg max /fi© (y 0) (4.27) 

e 

In practice it is convenient to maximise the log-likelihood function instead 
of the likelihood: 

0ml = ar g max l°g f Y is (i" 1 0 ) (4.28) 

e 

The log-likelihood is usually chosen in practice because: 

(a) the logarithm is a monotonic function, and hence the log-likelihood 
has the same turning points as the likelihood function; 

(b) the joint log-likelihood of a set of independent variables is the sum 
of the log-likelihood of individual elements; and 

(c) unlike the likelihood function, the log-likelihood has a dynamic 
range that does not cause computational under-flow. 

Example 4.3 ML Estimation of the mean and variance^ of a Gaussian 
jjmcesfj Consider the problem of maximum likelihood estimation of the 
mean vector fi and the covariance matrix Z of a P-dimensional 
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Gaussian vector process from N observation vectors [y(0),y(l),. ..,j(7V-l)]. 

Assuming the observation vectors are uncorrelated, the pdf of the 
observation sequence is given by 



N - 1 



f Y (y( = H 



1 



exp<{ Z y l\y (m) - jA y ] 



m= o ( 2 /r ) 



P/2 



1 1 / 2 






(4.29) 



and the log-likelihood equation is given by 



N - 1 

ln/ y (y(0),...,y(A-l))= £ 

m = 0 



^ln(2 n) 



-In 

2 






~ \y( m ) 

jLmt 



] T Z; l y [y(m)-Hy]\ 



■»y' 



(4.30) 



Taking the derivative of the log-likelihood equation with respect to the 
mean vector fi yields 



dln/ y (y(0),...,y(A-l)) 

d^y 




-2Z yy y(m) ]=0 



(4.31) 



From Equation (4.31), we have 




(4.32) 



To obtain the ML estimate of the covariance matrix we take the derivative 
of the log-likelihood equation with respect to Z yy : 



d]nf Y (y(0) 9 --- 9 y(N -1)) 




-l 




I 

-[y (m) - /Li v ][y (m) - IX y ] 




(4.33) 

From Equation (4.31), we have an estimate of the covariance matrix as 




i AM 

-^[y(m)-fi y ][y(m)-fl y ] T 

/V r\ 



(4.34) 
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Example 4.4 ML and MAP Estimation of a Gaussian Random Parameter . 
Consider the estimation of a P-dimensional random parameter vector 6 from 
an TV-dimensional observation vector y. Assume that the relation between 
the signal vector y and the parameter vector G is described by a linear model 
as 

y=GG+e (4.35) 



where e is a random excitation input signal. The pdf of the parameter vector 
0 given an observation vector y can be described, using Bayes’ rule, as 



few (0 I y) — 



1 

fy (y ) 



fy\e (y 10)/© (0) 



(4.36) 



Assuming that the matrix G in Equation (4.35) is known, the likelihood of 
the signal y given the parameter vector G is the pdf of the random vector e: 

f Yie(.yl0) = / E( e ~ y ~ GG) (4.37) 

Now assume the input e is a zero-mean, Gaussian-distributed, random 
process with a diagonal covariance matrix, and the parameter vector G is 
also a Gaussian process with mean of /l e and covariance matrix E eg . 
Therefore we have 



fy\e (y 1 0 ) —/e ( e ) — 



1 



n „2xiV/2 
(27TOV ) 



exp 



1 (y -GG) T (y -GG) 



2(7 



(4.38) 



and 



fe(8) = 



1 



(2k) 



P/2 



00 



1/2 



exp 



-fe-n e ) r i^(e-n e ) 



(4.39) 



The ML estimate obtained from maximisation of the log-likelihood function 
ln[/y| 0 (y 1 0)] with respect to 0is given by 



6 ml (j)=(g t g)'‘g t j> (4.40) 

To obtain the MAP estimate we first form the posterior distribution by 
substituting Equations (4.38) and (4.39) in Equation (4.36) 
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few (0 1 30 — 



1 



fy (y) (2no;) 
r 



1 

2yN/2 



1 



(2 K) 



P/2 



00 



1/2 



xexp 



1 ( y-G0) T (y-G0)-ke 



\ 



V 



2cj 



~ Be) 



) 



(4.41) 

The MAP parameter estimate is obtained by differentiating the log- 
likelihood function In f 0 \ Y (Q I y ) and setting the derivative to zero: 



e U Ap(y)=(G T G+^E ^ )"' (p T y + ^E^fl e ) < 4 ' 42 ) 

Note that as the covariance of the Gaussian-distributed parameter increases, 
or equivalently as Zqq —> 0 , the Gaussian prior tends to a uniform prior and 

the MAP solution Equation (4.42) tends to the ML solution given by 
Equation (4.40). Conversely as the pdf of the parameter vector 6 becomes 
peaked, i.e. as ^eo 0 , the estimate tends towards fi$- 



4.2.3 Minimum Mean Square Error Estimation 

The Bayesian minimum mean square error (MMSE) estimate is obtained as 
the parameter vector that minimises a mean square error cost function 
(Figure 4.8) defined as 

mmse( Q bO=£[(0-0) 2 1 3’] 

= \{6-6) 2 f m {6\y)d6 (443) 

e 



In the following, it is shown that the Bayesian MMSE estimate is the 
conditional mean of the posterior pdf. Assuming that the mean square error 
risk function is differentiable and has a well-defined minimum, the MMSE 
solution can be obtained by setting the gradient of the mean square error risk 
function to zero: 

^mmseW = J ( 0 | y)d0 - 2 jef 0lY (0 I y)dG 



(4.44) 
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Figure 4.8 Illustration of the mean square error cost function and estimate. 



Since the first integral on the right hand-side of Equation (4.42) is equal to 
1 , we have 



d^MMSE (jlll 

y\ 

dd 



— 2 d — j^d fg\y (dd I v ) dd 



(4.45) 



The MMSE solution is obtained by setting Equation (4.45) to zero: 

&MMSE (j) — J (0 I y)d0 (4.46) 

e 

For cases where we do not have a pdf model of the parameter process, the 
minimum mean square error (known as the least square error, LSE) estimate 
is obtained through minimisation of a mean square error function 

‘L[e\d\y)V. 

0 LSE ~ ar g min ‘E[e 2 (0 I y)] (4.47) 

e 

Th LSE estimation of Equation (4.47) does not use any prior knowledge of 
the distribution of the signals and the parameters. This can be considered as 
a strength of LSE in situations where the prior pdfs are unknown, but it can 
also be considered as a weakness in cases where fairly accurate models of 
the priors are available but not utilised. 
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Example 4.5 Consider the MMSE estimation of a parameter vector 6 
assuming a linear model of the observation y as 

y = GO + e (4.48) 



The LSE estimate is obtained as the parameter vector at which the gradient 
of the mean squared error with respect to 9 is zero: 



de T 



— =—(y T y-29 T G T y + 9 T G T G9) 
dG dG 



= 0 



(4.49) 



e 



LSE 



From Equation (4.49) the LSE parameter estimate is given by 

9 L se =[G t G]" 1 G t j (4.50) 

Note that for a Gaussian likelihood function, the LSE solution is the same as 
the ML solution of Equation (4.40). 



4.2.4 Minimum Mean Absolute Value of Error Estimation 

The minimum mean absolute value of error (MAVE) estimate (Figure 4.9) 
is obtained through minimisation of a Bayesian risk function defined as 

^mave (0\y)= t E[\0-0\y]=j\0-0\f ev (0\y)d0 (4.5 1) 

e 

In the following it is shown that the minimum mean absolute value estimate 
is the median of the parameter process. Equation (4.51) can be re-expressed 
as 

A r 0 A J»00 /V 

^mave(@ IjO = J_J0 — @]fo\Y (0 \ y)dQ + j\ [6 — 0]f@\Y (01 y)d6 

(4.52) 

Taking the derivative of the risk function with respect to 6 yields 

M.mave(6 ly) = f e f (e \ y)d e-r f m (9\y)d9 (4.53) 

dd J -°° J ° 
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Figure 4.9 Illustration of mean absolute value of error cost function. Note that the 
MAVE estimate coincides with the conditional median of the posterior function. 



The minimum absolute value of error is obtained by setting Equation (4.53) 
to zero: 



fo\Y (0 I y)d0 — L /0|y(0lj)^0 (4.54) 

Jtf MAVE 

From Equation (4.54) we note the MAVE estimate is the median of the 
posterior density. 



r^M 

J — oo 



MAVE 



4.2.5 Equivalence of the MAP, ML, MMSE and MAVE for 
Gaussian Processes With Uniform Distributed Parameters 

Example 4.4 shows that for a Gaussian-distributed process the LSE estimate 
and the ML estimate are identical. Furthermore, Equation (4.42), for the 
MAP estimate of a Gaussian-distributed parameter, shows that as the 
parameter variance increases, or equivalently as the parameter prior pdf 
tends to a uniform distribution, the MAP estimate tends to the ML and LSE 
estimates. In general, for any symmetric distribution, centred round the 
maximum, the mode, the mean and the median are identical. Hence, for a 
process with a symmetric pdf, if the prior distribution of the parameter is 
uniform then the MAP, the ML, the MMSE and the MAVE parameter 
estimates are identical. Figure 4.10 illustrates a symmetric pdf, an 
asymmetric pdf, and the relative positions of various estimates. 
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Figure 4.10 Illustration of a symmetric and an asymmetric pdf and their respective 
mode, mean and median and the relations to MAP, MAVE and MMSE estimates. 



4.2.6 The Influence of the Prior on Estimation Bias and Variance 

The use of a prior pdf introduces a bias in the estimate towards the range of 
parameter values with a relatively high prior pdf, and reduces the variance 
of the estimate. To illustrate the effects of the prior pdf on the bias and the 
variance of an estimate, we consider the following examples in which the 
bias and the variance of the ML and the MAP estimates of the mean of a 
process are compared. 

Example 4.6 Consider the ML estimation of a random scalar parameter 0 , 
observed in a zero-mean additive white Gaussian noise (AWGN) n(m), and 
expressed as 

y(m) - 0 + n(m), m= 0,..., N- 1 (4.55) 

It is assumed that, for each realisation of the parameter 0 , N observation 
samples are available. Note that, since the noise is assumed to be a zero- 
mean process, this problem is equivalent to estimation of the mean of the 
process y(m). The likelihood of an observation vector y=[y(0), y( 1 ), ..., 
y(N- 1 ) j and a parameter value of 0 is given by 



AM 



fy 10 (y i P[/v (y( m ) —@) 

m = 0 



1 



/ O ~rsrr 2 \ N / 2 

(2 na n ) 



exp 



1 N ~ l 

2^b(m)-0] 

m = 0 



(4.56) 
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From Equation (4.56) the log-likelihood function is given by 



ln /yi©^ I $) 



N 9 
- ln(2®<j„ 2 ) 



i N-l 

— y(m)-e ] 2 

a _ Z * 

// m=0 



(4.57) 



The ML estimate of 9 , obtained by setting the derivative of ] n f Y \ @(y\9) to 
zero, is given by 

1 N ~ l 

^ML=—^y( m ) = y ( 4 -58) 

™ m = 0 

where y denotes the time average of y(m). From Equation (4.56), we note 
that the ML solution is an unbiased estimate 



( ^ N-l 



no ML ] = ‘E 



+n ( m ^ 

v m = 0 



A 



) 



= 9 



(4.59) 



and the variance of the ML estimate is given by 



Var[0 ML ] = £[(0 Mi -0) 2 ]=£ 



( i N-l 



1 7V ~ X 

-J J y(m)-9 






v N m=0 



J 



a 



n 



N 



(4.60) 



Note that the variance of the ML estimate decreases with increasing length 
of observation. 



Example 4.7 Estimation of a uniformly-distributed parameter observed in 
AWGN. Consider the effects of using a uniform parameter prior on the mean 
and the variance of the estimate in Example 4.6. Assume that the prior for 
the parameter 9 is given by 



f Q (0) 



ji/^max 



-9 • ) 

^min / 



0 < 0^0 
^min — ^ ^ max 



otherwise 



(4.61) 



as illustrated in Figure 4.1 1. From Bayes’ rule, the posterior pdf is given by 




Bayesian Estimation 



111 




Figure 4.11 Illustration of the effects of a uniform prior. 



few i y) — 



1 



f Y (y) 

i i 



/f 10 (J I ®)/© (®) 






/fOO (2no„ ) 

0, 



2 \ A/72 
n 



exp 



i am 1 

j^[y(m)-df \ 



2 

ft 



2(7 M m = 0 



0 <0<0 
min — — max 



otherwise 



(4.62) 

The MAP estimate is obtained by maximising the posterior pdf: 





A 

^min 


if 


^ ML ( 30 ^ ^min 




Q map (y)-* 


^ ml (y ) 


if 


^min — ^ ML (y ) — ^max 


(4.63) 




A 

^max 


if 


^ ML (30 ^ ^max 





Note that the MAP estimate is constrained to the range Q mm to <9 max . This 
constraint is desirable and moderates the estimates that, due to say low 
signal-to-noise ratio, fall outside the range of possible values of 0. It is easy 
to see that the variance of an estimate constrained to a range of 0 min to 0 max 

is less than the variance of the ML estimate in which there is no constraint 
on the range of the parameter estimate: 



e 



max 



oo 



Var [0 MAP ]~ \(®map ~9) 2 fy\&(y\0 dy <Var[0 MZ J- J (d ML -0) 2 f Y \@(y dy 



a 

u min 



-oo 



(4.64) 
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Figure 4.12 Illustration of the posterior pdf as product of the likelihood and the prior. 



Example 4.8 Estimation of a Gaus sian-distributed parameter observed in 
AWGN. In this example, we consider the effect of a Gaussian prior on the 
mean and the variance of the MAP estimate. Assume that the parameter 0 is 
Gaussian-distributed with a mean ji $ and a variance Gq as 



fe(0) 



1 

TATT 27172 
(2kg q ) 



exp 



2cr e 2 



(4.65) 



From Bayes rule the posterior pdf is given as the product of the likelihood 
and the prior pdfs as: 



/©if (0 1 y) — 



l 

f Y (y) 

i 



/no (y i 0)/© (0) 



1 



/yOO (2nol) N 12 (2 kOq ) 



2x1/2 



exp 



1 



N - 1 



2 



2(7 M m=0 



y [j(m)-0] - 



2(7 



1 (9-/u d ) 2 



e 






(4.66) 

The maximum posterior solution is obtained by setting the derivative of the 
log-posterior function, ln/@|y(0 ly), with respect to 0 to zero: 



®MAp(y ) 



G 



e 



<jj + ol/N 



y + 



gI/n 



n 



G 2 Q + gI/N 



Ve 



n 



(4.67) 



N - 1 

where y=\y(m)/N. 

m = 0 

Note that the MAP estimate is an interpolation between the ML estimate y 
and the mean of the prior pdf /Jq, as shown in Figure 4.12. The expectation 
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Figure 4.13 Illustration of the effect of increasing length of observation on the 

variance an estimator. 

of the MAP estimate is obtained by noting that the only random variable on 
the right-hand side of Equation (4.67) is the term y , and that £ [ y \=d 

£ie mp (y) ]= 2 e 2/ e+ "I n e (4.68) 

°e+°n/N °9+°n/N 

and the variance of the MAP estimate is given as 

„ (7 a C7 2 In 

Var [d MAP (y)\ = \ xVar[y] = (4.69) 

a e +a n /n 1 + o 2 jNo 2 e 

Substitution of Equation (4.58) in Equation (4.67) yields 

Var \d MA p(y)] = Var[0 ML(3O] (4.70) 

l + Var[e ML (y)]/<7 2 

Note that as Oq , the variance of the parameter 6, increases the influence of 

the prior decreases, and the variance of the MAP estimate tends towards the 
variance of the ML estimate. 

4.2.7 The Relative Importance of the Prior and the Observation 

A fundamental issue in the Bayesian inference method is the relative 
influence of the observation signal and the prior pdf on the outcome. The 
importance of the observation depends on the confidence in the observation, 
and the confidence in turn depends on the length of the observation and on 
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the signal-to-noise ratio (SNR). In general, as the number of observation 
samples and the SNR increase, the variance of the estimate and the influence 
of the prior decrease. From Equation (4.67) for the estimation of a Gaussian 
distributed parameter observed in AWGN, as the length of the observation N 
increases, the importance of the prior decreases, and the MAP estimate tends 
to the ML estimate: 



/ 



limit 0 MAP (jv ) = limit 

N — >00 ./V— 



<7 



6 



V 



ol+ol/N 



y+ 









ol+ol/N 



He 



— y —Q ml (4-71) 



j 



As illustrated in Figure 4.13, as the length of the observation N tends to infinity 
then both the MAP and the ML estimates of the parameter should tend to its true 
value 9. 



Example 4.9 MAP estimation of a signal in additive noise. Consider the 
estimation of a scalar- valued Gaussian signal v(m), observed in an additive 
Gaussian white noise n(m ), and modelled as 

y{m)—x{m)+n{m) (4.72) 



The posterior pdf of the signal x(m) is given by 



fx\Y (*(m)| y(m))= 



1 



fy (y( m )) 

1 

fy(y(m)) 



fy\x (y( m )\ x ( m ))fx 
In (y( m ) ~ x ( m ))f x ( x ( m )) 



(4.73) 



where f x (x(m))-0\[{x( m )^x ,o 2 x ) and f N ( n(m))=xSn(m),H n ,o l ) are the 

Gaussian pdfs of the signal and noise respectively. Substitution of the signal 
and noise pdfs in Equation (4.73) yields 



fx\Y ( x (m ) I y(m)) = 



1 



1 



fy (y( m )) V27T 



exp 



< - 



[y(m)-x(m)-}i n ] 



a 
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20 



n 



X 
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V2 ~n 



exp 



< - 



[x(m)-/u x ] 



a 



X 



2o 



X 



(4.74) 



This equation can be rewritten as 
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fx\Y {x(_m)\y(m)) = 



f Y ( y(m )) 2 KO n o 



expi 






<y 2 x [y(m) -x(m)-/u n ] 2 +<J l[x(m)~n x ]' 



^>_2 2 



(4.75) 



To obtain the MAP estimate we set the derivative of the log-likelihood 
function \nfx\Y (x(m) I y(m)) with respect to x(m ) to zero as 



d[lnf xff (x(m)\y(m))\ _ -2o 2 x (y(m) - x(m) - jl n ) + 2a 2 (x(m) - jl x ) 
dx(m) 2cr 2 x ol 

(4.76) 

From Equation (4.76) the MAP signal estimate is given by 



x(m ) — 



a 



X 



2 . 2 
G x ® n 



[y(m)-ju n ] + 



o 



n 



2 . 2 
® x G n 






x 



(4.77) 



A 

Note that the estimate x(m) is a weighted linear interpolation between the 
unconditional mean of x(m), fl x , and the observed value (y(mh/l n ). At a very 

r\ a 

poor SNR i.e. when a x «(>„ we have x(m) ~ /_i x ; and, on the other hand, 
for a noise-free signal <r„ = 0 and = 0 and we have x(m) — yim ) . 



Example 4.10 MAP estimate of a Gaussian-AR process observed in 
AWGN. Consider a vector of N samples x from an autoregressive (AR) 
process observed in an additive Gaussian noise, and modelled as 

y = x + n (4.78) 

From Chapter 8, a vector x from an AR process may be expressed as 



e=Ax (4.79) 

where A is a matrix of the AR model coefficients, and the vector e is the 
input signal of the AR model. Assuming that the signal x is Gaussian, and 
that the P initial samples x () are known, the pdf of the signal x is given by 
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f x ( x \ Xq) - f E (e)- 



1 



(27107 ) 



\N/2 



exp 
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2(7 



a: t A t Aa: 



j 



(4.80) 



where it is assumed that the input signal e of the AR model is a zero-mean 
uncorrelated process with variance 07. The pdf of a zero-mean Gaussian 
noise vector n, with covariance matrix Z nn , is given by 



In ( w ) - 



1 



{In) 



Nil 



nn 



1/2 



exp 
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1 Ty-l 

— n 
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nn 
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(4.81) 



From Bayes’ rule, the pdf of the signal given the noisy observation is 



fxw ( x 1 y) 



fy\x (y\ x )fx ( x ) 

f Y (y ) 



1 

fy(y) 



In (y-x)fx (*) 



(4.82) 



Substitution of the pdfs of the signal and noise in Equation (4.82) yields 



f X\Y (•*■ I 7) ~ 
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f Y {y){2n) N a 



N 12 
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nn 



1/2 



exp 
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(y-x) T z n h(y-x) + 



a: t A t Aa: 



G. 



(4.83) 



The MAP estimate corresponds to the minimum of the argument of the 
exponential function in Equation (4.83). Assuming that the argument of the 
exponential function is differentiable, and has a well-defined minimum, we 
can obtain the MAP estimate from 



x map (y) = ar g zero \ 

x 



d_ 

dx 



(y-x) T X H l(y-x)4 



A7 T A T Aa7 



(4.84) 
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The MAP estimate is 



x map (y ) - 



/ 1 T ^ 1 

I + — x„„a t a 



nn 



V 



a: 



J 



(4.85) 



where I is the identity matrix. 
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4.3 The Estimate-Maximise (EM) Method 

The EM algorithm is an iterative likelihood maximisation method with 
applications in blind deconvolution, model-based signal interpolation, 
spectral estimation from noisy observations, estimation of a set of model 
parameters from a training data set, etc. The EM is a framework for solving 
problems where it is difficult to obtain a direct ML estimate either because 
the data is incomplete or because the problem is difficult. 

To define the term incomplete data, consider a signal x from a random 
process X with an unknown parameter vector G and a pdf f x .^x;G). The 

notation fx- Ax; G) expresses the dependence of the pdf of X on the value of 

the unknown parameter G. The signal x is the so-called complete data and 
the ML estimate of the parameter vector G may be obtained from I x .q(x;G). 

Now assume that the signal x goes through a many-to-one non-invertible 
transformation (e.g. when a number of samples of the vector x are lost) and 
is observed asy. The observation y is the so-called incomplete data. 
Maximisation of the likelihood of the incomplete data, fy dytQ), with 

respect to the parameter vector G is often a difficult task, whereas 
maximisation of the likelihood of the complete data f x -^x;G) is relatively 

easy. Since the complete data is unavailable, the parameter estimate is 
obtained through maximisation of the conditional expectation of the log- 
likelihood of the complete data defined as 

‘E[ln/ Z;0 (x;0)|y]= J f xlY . 0 (x\y,G)\nf x . 0 (x-G)dx (4.86) 

x 

In Equation (4.86), the computation of the term f X \ Y: 0 (x\y ; G) requires an 

estimate of the unknown parameter vector G. For this reason, the expectation 
of the likelihood function is maximised iteratively starting with an initial 
estimate of G, and updating the estimate as described in the following. 
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/r ; ©O;0) 



Figure 4.14 Illustration of transformation of complete data to incomplete data. 
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EM Algorithm 

Step 1: Initialisation Select an initial parameter estimate 6 0 , and 

for i = 0, 1, ... until convergence: 

Step 2: Expectation Compute 



u(e,e l )=E[inf x& (x-e)\y,e i ) 

= J f X \Y ( X 1 y 'A ) 111 fx ;© fo 1 0 ) dx 



(4.87) 



x 



Step 3: Maximisation Select 

0 i+ i = arg max U ( 6,§i ) (4.88) 

e 

Step 4: Convergence test If not converged then go to Step 2. 

4.3.1 Convergence of the EM Algorithm 

In this section, it is shown that the EM algorithm converges to a maximum 
of the likelihood of the incomplete data /y-0(y;0). The likelihood of the 
complete data can be written as 



f X ,Y;& ( JS 0) ~fx\Y;0 J’^)/y;0 (jS^) 



(4.89) 



where f x Y q(x , y ; 6) is the likelihood of x and j with 6 as a parameter. From 
Equation (4.89), the log-likelihood of the incomplete data is obtained as 



ln /y;0(j;^) = ln fx,Y-,0( X ’y'’O)- ln fx\Y-,0( x \ 



(4.90) 



Using an estimate 0 t of the parameter vector 0, and taking the expectation 
of Equation (4.90) over the space of the complete signal x, we obtain 



In f Y , Q (y,e)=U(d-0 i )-V(0-e i ) 



(4.91) 



where for a given y, the expectation of \nf Y . 0 (y:6) is itself, and the function 

y\ 

U(6; 0) is the conditional expectation of In f x r . 0 (x,_y;0): 
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U(G,O l )=‘E[\nf x F;0 (x, y;G)\y;G i ) 

— | fx\Y;0 ( x I y'^i)^ n fx;0 ( X,G)dx 

X 



(4.92) 



The function V(G,G) is the conditional expectation of \nfx\y <~f.x\y;G): 

V(G# i )=^fx\Y-0(x\y-,G)\y-,G i } 

= J fx\Y;0 )l n /ziF ;0 (x\y',G)dx 

X 



(4.93) 



Now, from Equation (4.91), the log-likelihood of the incomplete data y with 

A 

parameter estimate G: at iteration i is 



In fY 0 (y,G i ) = U( 6 i ;G i )-V(O i -, 6 i ) 



(4.94) 



It can be shown (see Dempster et al., 1977) that the function V satisfies the 
inequality 



V(0 M ;0i) <m;0«) 



(4.95) 



and in the maximisation step of EM we choose G i+l such that 

U(G m :G,) > U(G,:G,) (4.96) 

From Equation (4.94) and the inequalities (4.95) and (4.96), it follows that 

In fY ; 0 (y,O i+l ) > In f Y . 0 (y, 6 i) (4.97) 

Therefore at every iteration of the EM algorithm, the conditional likelihood 
of the estimate increases until the estimate converges to a local maximum of 

the log-likelihood function lnf Y Ay ,G). 

The EM algorithm is applied to the solution of a number of problems in 
this book. In Section 4.5, of this chapter the estimation of the parameters of 

a mixture Gaussian model for the signal space of a recorded process is 

formulated in an EM framework. In Chapter 5, the EM is used for estimation 
of the parameters of a hidden Markov model. 
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4.4 Cramer-Rao Bound on the Minimum Estimator Variance 

An important measure of the performance of an estimator is the variance of 
the estimate with the varying values of the observation signal y and the 
parameter vector 6. The minimum estimation variance depends on the 
distributions of the parameter vector G and on the observation signal y. In 
this section, we first consider the lower bound on the variance of the 
estimates of a constant parameter, and then extend the results to random 
parameters. 

The Cramer-Rao lower bound on the variance of estimate of the i th 
coefficient (9/ of a parameter vector 0 is given as 



Var [0 f OO] > 




f 




00 ; 



10 (y 1 0) 



V 





J 



(4.98) 



An estimator that achieves the lower bound on the variance is called the 
minimum variance, or the most efficient, estimator. 



Proof The bias in the estimate Ofy) of the I th coefficient of the parameter 
vector 6, averaged over the observation space Y, is defined as 
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Differentiation of Equation (4.99) with respect to 0i yields 
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For a probability density function we have 
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(4.100) 
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Therefore Equation (4.100) can be written as 



oo 

J [«,(>)»,] 

— oo 



d fv\o (y i 0) 




dy = 1 + 




(4.102) 



Now, since the derivative of the integral of a pdf is zero, taking the 
derivative of Equation (4.101) and multiplying the result by 0 Bias yields 




dy = 0 



(4.103) 



Substituting d f Y]0 (y 1 6)136, = f Y \ @ (y 1 0)<?ln / F | 0 (y 1 0)/00 ; - into 
Equation (4.102), and using Equation (4.103), we obtain 



7 ~ <9ln/yi 0 (y\6) 

J [e, ( )- ©Bias - e, ] ^ f rle (y\e)dy = l + 

— oo ^ 




(4.104) 



Now squaring both sides of Equation (4.104), we obtain 



j Wi(y) “^Bias 

^ — oo 




d^fy\0 (y I 0) 





f Y \ 0 (y\O)dy 



j 




(4.105) 



For the left-hand side of Equation (4.105) application of the following 
Schwartz inequality 



( 



oo 



\2 



f(y)g(y)dx 






— oo 



oo 



oo 



< 



(. f(y)) 2 dxx J(g(y)) 2 dy (4.106) 



j 



— oo 



— oo 



yields 
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(4.107) 



From Equations (4.105) and (4.107), we have 



Var[0;(y)]x£ 



/ 



v 



^l n /yi0 (j I 0) ^ 



00 ; 



/ 



> 



/ 



V 



1 



~)n \ 2 

^Bias 



00; 



l ) 



(4.108) 



The Cramer-Rao inequality (4.98) results directly from the inequality 
(4.108). 



4.4.1 Cramer-Rao Bound for Random Parameters 



For random parameters the Cramer-Rao bound may be obtained using the 
same procedure as above, with the difference that in Equation (4.98) instead 
of the likelihood /yig/y 10) we use the joint pdf/y 0 (y,0), and we also use the 
logarithmic relation 



<9ln/y 0 (y,0) _ <9ln/y| 0 (yl0) <9ln/ 0 (0) 

00 . “ 0 0 . 00 . 

I l l 



(4.109) 



The Cramer-Rao bound for random parameters is obtained as 



j _l_ ^^Bias 



Var [0Ay)]> 



A 2 



d6 ; 



/ 
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dinfyio (y I ® ^ 



<90; 



+ 
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gjn fem 

d6 ; 






J 



(4.110) 



where the second term in the denominator of Equation (4.110) describes the 
effect of the prior pdf of 0. As expected the use of the prior ,/ 0 (0), can result 
in a decrease in the variance of the estimate. An alternative form of the 
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minimum bound on estimation variance can be obtained by using the 
likelihood relation 



/ 



v 



^l n /y,0 (y>Q) \ 



dO: 



j 



- -*£ 



d 2 l n /y ,0 

do} 
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Var [0:(y)] > - 
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1 + 
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Bias 



\ 2 



dO : 



J 



d 2 l n /yi0 (y I 0) d 2 ln/ 0 ( 0 ) 



dot 



dot 



(4.111) 



(4.112) 



4.4.2 Cramer-Rao Bound for a Vector Parameter 

For real-valued P-dimensional vector parameters, the Cramer-Rao bound 
for the covariance matrix of an unbiased estimator of 0 is given by 

Cov[0] >J~ l (0) (4.113) 



where J is the P x P Fisher information matrix, with elements given by 




d 2 l n /y,0 (y>Q) 

dO i dO j 



(4.114) 



The lower bound on the variance of the z' th element of the vector 0 is given 
by 



Var (6^ > [j-'(0)] ti 



1 



<v 


d 2 1h/f,0 (y>0) 


±-J 


— i 

CD 

i 



(4.115) 



where (J^(0) u ) is the z th diagonal element of the inverse of the Fisher 
matrix. 
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Figure 4.15 Illustration of probabilistic modelling of a two-dimensional signal 
space with a mixture of five bivariate Gaussian densities. 



4.5 Design of Mixture Gaussian Models 

A practical method for the modelling of the probability density function of 
an arbitrary signal space is to fit (or “tile”) the space with a mixture of a 
number of Gaussian probability density functions. Figure 4.15 illustrates the 
modelling of a two-dimensional signal space with a number of circular and 
elliptically shaped Gaussian processes. Note that the Gaussian densities can 
be overlapping, with the result that in an area of overlap, a data point can be 
associated with different probabilities to different components of the 
Gaussian mixture. 

A main advantage of the use of a mixture Gaussian model is that it 
results in mathematically tractable signal processing solutions. A mixture 
Gaussian pdf model for a process X is defined as 

K 

f x (x) = y £ J P k *C k (x;LL k ,Z k ) (4.116) 

k=\ 

where 9 ^ k (x;jj, k , Z k ) denotes the k th component of the mixture Gaussian 
pdf, with mean vector jil k and covariance matrix E k . The parameter P k is the 
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prior probability of the k th mixture, and it can be interpreted as the expected 
fraction of the number of vectors from the process X associated with the /c th 
mixture. 

In general, there are an infinite number of different ^-mixture Gaussian 
densities that can be used to “tile up” a signal space. Hence the modelling of 
a signal space with a A'-mixture pdf space can be regarded as a many-to-one 
mapping, and the expectation-maximisation (EM) method can be applied for 
the estimation of the parameters of the Gaussian pdf models. 



4.5.1 The EM Algorithm for Estimation of Mixture Gaussian 
Densities 

The EM algorithm, discussed in Section 4.4, is an iterative maximum- 
likelihood (ML) estimation method, and can be employed to calculate the 
parameters of a K-mixture Gaussian pdf model for a given data set. To 
apply the EM method we first need to define the so-called complete and 
incomplete data sets. As usual the observation vectors [y(m) m=0, ..., iV— 1] 
form the incomplete data. The complete data may be viewed as the 
observation vectors with a label attached to each vector y(m) to indicate the 
component of the mixture Gaussian model that generated the vector. Note 
that if each signal vector y(m ) had a mixture component label attached, then 
the computation of the mean vector and the covariance matrix of each 
component of the mixture would be a relatively simple exercise. Therefore 
the complete and incomplete data can be defined as follows: 

The incomplete data y (m), m=0 ,. . . , N — 1 

The complete data x(m)-[y(m),k]= y k (m), m-0,...,N — l,Jc e(l,..., K) 



The probability of the complete data is the probability that an observation 
vector y(m) has a label k associating it with the k th component of the mixture 
density. The main step in application of the EM method is to define the 
expectation of the complete data, given the observations and a current 
estimate of the parameter vector, as 



U (0,0j) = !E[ln f Yt K;e ( y( m X k-,0) \y(m);0 i ] 

V V fY,Kv(y(.m),k)\ei) 

= XX i In Jy K . 0 (y(m),k;0) 

tot 



(4.117) 
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where 0={Q k =[P k , fi k , k=l,..., K], are the parameters of the Gaussian 

mixture as in Equation (4.116). Now the joint pdf of y(m ) and the k th 
Gaussian component of the mixture density can be written as 




(4.118) 



where 0\C k {y (m); fl k ,Z k ) is a Gaussian density with mean vector ji k and 



covariance matrix Z : 



1 



*L k (y( m )W k , p / 2 i v 1 1/2 ex p( --(y(rn)-V k f Z k x (y(m)-ll k )\ 

(2k) \Z ^ V 2 > 

(4.119) 

The pdf of y(m) as a mixture of K Gaussian densities is given by 



fy\e (y( m )6i )=^C(y(.m)di ) 



K 



I l l 

k = i 



(4.120) 



Substitution of the Gaussian densities of Equation (4.118) and Equation 
(4.120) in Equation (4.1 17) yields 



„ - . K p k 9 l k {y(m)-^ k , Z k .) 

U[(n, Z,P),( fJ-i, E i ,P i )] = X 77 ln ^My(m)^ k , Z k )} 



N - 1 K 

-II 

m -0 k — 1 



P k: ^Ck(y( m ^fi k: > ,) ^ ^(y(m);/it. , X fc ) A 

- In/** + ' / , , ia\ In (Jfc Wk’ *k) 






fJ\C(y(m)\Oj) 



9\C(y(m)\®i') 



(4.121) 



Equation (4.121) is maximised with respect to the parameter P k using the 
constrained optimisation method. This involves subtracting the constant 
term HP k =l from the right hand side of Equation (4.121) and then setting 
the derivative of this equation with respect to P k to zero, this yields 
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Pu = argmax /*•)] 

^ +1 Pk 



J_ ^ K *Ck (y(m);fi k} , Z k; ) 



N 



m-0 



%C(y(m)\&i) 



(4.122) 



The parameters n k and Z k that maximise the function U are obtained, by 

setting the derivative of the function with respect to these parameters to 
zero: 



A A 



A**.., = argmax U[(n,Z,P),(p if 



/ + / 



P k 

v Pk.*tk(y(m);ll k .,£ k .) 

Y — i 7 2 r f y(m) 

^T 0 ^(y(m)|0 ; ) 



N y' Iky Zkj) 



and 



(4.123) 



y\ yv 



. = argmax {/[(jU.X.P),^-,^, P ; )] 
*+•' x,. 



w-i p k .*C k (y(m);fi k .,z k .) 

X 1 i?7 . A \ ' (y(m)-jl k )(y(m)-/ik ) T 



ho *C(y(m)\®i) 

(4.124) 

Equations (4.122)-(4.124) are the estimates of the parameters of a mixture 
Gaussian pdf model. These equations can be used in further iterations of the 
EM method until the parameter estimates converge. 



4.6 Bayesian Classification 

Classification is the processing and labelling of an observation sequence 
{y(m)} with one of M classes of signals {C k ; k= 1, ..., M } that could have 

generated the observation. Classifiers are present in all modern digital 
communication systems and in applications such as the decoding of 
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Figure 4.16 - Illustration of the overlap of the distribution of two classes of signals. 



discrete-valued symbols in digital communication receivers, speech 
compression, video compression, speech recognition, image recognition, 
character recognition, signal/noise classification and detectors. For example, 
in an M-symbol digital communication system, the channel output signal is 
classified as one of the M signalling symbols; in speech recognition, 
segments of speech signals are labelled with one of about 40 elementary 
phonemes sounds; and in speech or video compression, a segment of speech 
samples or a block of image pixels are quantised and labelled with one of a 
number of prototype signal vectors in a codebook. In the design of a 
classifier, the aim is to reduce the classification error given the constraints 
on the signal-to-noise ratio, the bandwidth and the computational resources. 

Classification errors are due to overlap of the distributions of different 
classes of signals. This is illustrated in Figure 4.16 for a binary classification 
problem with two Gaussian distributed signal classes C x and C 2 . In the 

shaded region, where the signal distributions overlap, a sample x could 
belong to either of the two classes. The shaded area gives a measure of the 
classification error. The obvious solution suggested by Figure 4.16 for 
reducing the classification error is to reduce the overlap of the distributions. 
The overlap can be reduced in two ways: (a) by increasing the distance 
between the mean values of different classes, and (b) by reducing the 
variance of each class. In telecommunication systems the overlap between 
the signal classes is reduced using a combination of several methods 
including increasing the signal-to-noise ratio, increasing the distance 
between signal patterns by adding redundant error control coding bits, and 
signal shaping and post-filtering operations. In pattern recognition, where it 
is not possible to control the signal generation process (as in speech and 
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image recognition), the choice of the pattern features and models affects the 
classification error. The design of an efficient classification for pattern 
recognition depends on a number of factors, which can be listed as follows: 

(1) Extraction and transformation of a set of discriminative features from 
the signal that can aid the classification process. The features need to 
adequately characterise each class and emphasise the difference 
between various classes. 

(2) Statistical modelling of the observation features for each class. For 
Bayesian classification, a posterior probability model for each class 
should be obtained. 

(3) Labelling of an unlabelled signal with one of the N classes. 



4.6.1 Binary Classification 

The simplest form of classification is the labelling of an observation with 
one of two classes of signals. Figures 4.17(a) and 4.17(b) illustrate two 
examples of a simple binary classification problem in a two-dimensional 
signal space. In each case, the observation is the result of a random mapping 
(e.g. signal plus noise) from the binary source to the continuous observation 
space. In Figure 4.17(a), the binary sources and the observation space 
associated with each source are well separated, and it is possible to make an 
error-free classification of each observation. In Figure 4.17(b) there is less 
distance between the mean of the sources, and the observation signals have a 
greater spread. This results in some overlap of the signal spaces and 
classification error can occur. In binary classification, a signal x is labelled 
with the class that scores the higher a posterior probability: 




Using Bayes’ rule Equation (4.125) can be rewritten as 






(4.125) 



(4.126) 



Letting P c (C l )=P l and Pq{C 2 )=P 2 , Equation (4.126) is often written in 
terms of a likelihood ratio test as 
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(a) 




y- 




Figure 4.17 Illustration of binary classification: (a) the source and observation spaces 

are well separated, (b) the observation spaces overlap. 



fx\c( x \Ci) > P 2 
fx\c( x \^2) ^ 



(4.127) 



Taking the likelihood ratio yields the following discriminant function: 



c, 



h(x) = lnf xlc (x\C l )-lnf xlc (x\C 2 ) > In 



< 

c. 



(4.128) 



Now assume that the signal in each class has a Gaussian distribution with a 
probability distribution function given by 



fx\c( x \ c i )“ 



1 



4lK \Z: 



exp 






, i=l,2 (4.129) 



Bayesian Classification 



131 



From Equations (4.128) and (4.129), the discriminant function h(x) 
becomes 



c. 



1 1 

h(x)=- — (x - Zi (x - Hi) + — (x - fi 2 ) %2 ( x _ At 2 ) + ln 



^ P? 

^ In — 

P 



c. 



(4.130) 



Example 4.10 For two Gaussian-distributed classes of scalar- valued 

9 9 

signals with distributions given by 0\i(x(m),fi l ,a ) and ^(v(m), J u 2 ,(j ), 
and equal class probability P l =P 2 =0.5, the discrimination function of 
Equation (4.130) becomes 



2 2 C, 

h(x(m)) = Jh- X (m) + — — — ^ 0 (4.131) 

cr 2 2 (X 2 

'“'2 

Hence the rule for signal classification becomes 

x(m)^ ^ +Af2 (4.132) 

G 2 

The signal is labelled with class Cj if x{m)<{ii l +/i 2 )/2and as class C 2 
otherwise. 

4.6.2 Classification Error 

Classification errors are due to the overlap of the distributions of different 
classes of signals. This is illustrated in Figure 4.16 for the binary 
classification of a scalar- valued signal and in Figure 4.17 for the binary 
classification of a two-dimensional signal. In each figure the overlapped 
area gives a measure of classification error. The obvious solution for 
reducing the classification error is to reduce the overlap of the distributions. 
This may be achieved by increasing the distance between the mean values of 
various classes or by reducing the variance of each class. In the binary 
classification of a scalar-valued variable x, the probability of classification 
error is given by 
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P(Error\x) = P(Cj )P(x > Thrsh I xe C l ) +P(C 2 )P(x > Thrsh \xg C 2 ) (4.133) 

For two Gaussian-distributed classes of scalar-valued signals with pdfs 

2 2 

9i(x(m),ju 1 ,(7 l ) and 0\i(x(m),fi 2 ^2 )» Equation (4.133) becomes 



7 1 

P(Error\ x) = P(C j ) — exp 

Thrsh ^271 (J { 
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V 



2(7 



dx 



) 



(4.134) 
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+P(C2) I -m 
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2cj 



dx 
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where the parameter Thrsh is the classification threshold. 



4.6.3 Bayesian Classification of Discrete-Valued Parameters 

Let the set 0= { 0„ i =1, ..., M } denote the values that a discrete P- 
dimensional parameter vector 0 can assume. In general, the observation 
space Y associated with a discrete parameter space 0may be a discrete- 
valued or a continuous-valued space. Assuming that the observation space is 
continuous, the pdf of the parameter vector 0„ given observation vector y , 
may be expressed, using Bayes’ rule, as 



Pq\y (0; IjO 



/yi© (jl0/ )Pq (Qj) 



fv(y) 



(4.135) 



For the case when the observation space Y is discrete- valued, the probability 
density functions are replaced by the appropriate probability mass functions. 
The Bayesian risk in selecting the parameter vector 0, given the observation 
y is defined as 

M 

%XB t \ J) = X C ^ ^j)P m {Gj\y) (4.136) 

./=! 



where C(0,I0/) is the cost of selecting the parameter 0, when the true 
parameter is 0,. The Bayesian classification Equation (4.136) can be 
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employed to obtain the maximum a posteriori, the maximum likelihood and 
the minimum mean square error classifiers. 



4.6.4 Maximum A Posteriori Classification 

MAP classification corresponds to Bayesian classification with a uniform 
cost function defined as 

C(0 l \6j) = 1- 8(01,6 j) (4.137) 

where 8 (- ) is the delta function. Substitution of this cost function in the 
Bayesian risk function yields 

M 

j = i (4.138) 

= 1 - P 0 \ y ( 6i I y ) 



Note that the MAP risk in selecting 0; is the classification error probability; 
that is the sum of the probabilities of all other candidates. From Equation 
(4.138) minimisation of the MAP risk function is achieved by maximisation 
of the posterior pmf: 



0 map 00 = arg max P QW (6 t I y) 

o, 

= arg max P 0 (0 i )f Yl0 (y 1 0, ) 

o, 



(4.139) 



4.6.5 Maximum-Likelihood (ML) Classification 

The ML classification corresponds to Bayesian classification when the 
parameter 6 has a uniform prior pmf and the cost function is also uniform: 



M 



1V1 1 

^ML^i I y) =^,[1 - 8(6j,6j)\ _ fy\0 (j 1 9j)Pe (0 j ) 

j = i jY(y) 

i 

/y(j) 



= 1 - 



fy\e 0 I ® i )P& 



(4.140) 
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where Pq is the uniform pmf of 6. Minimisation of the ML risk function 
(4.140) is equivalent to maximisation of the likelihood / y | 0 (jl0 ; ) 



@ ml 00 = arg max f Y]0 ( y 1 0, ) 

0 , 



(4.141) 



4.6.6 Minimum Mean Square Error Classification 

The Bayesian minimum mean square error classification results from 
minimisation of the following risk function: 



M 



^MMSE (0j I y) - | ~®j P 0\Y j I y) 

7=1 



(4.142) 



For the case when is not available, the MMSE classifier is 

given by 



0 



MMSE 



(y) = arg min 0, -0(y) 




(4.143) 



where 6(y) is an estimate based on the observation y. 



4.6.7 Bayesian Classification of Finite State Processes 

In this section, the classification problem is formulated within the 
framework of a finite state random process. A finite state process is 
composed of a probabilistic chain of a number of different random 
processes. Finite state processes are used for modelling non- stationary 
signals such as speech, image, background acoustic noise, and impulsive 
noise as discussed in Chapter 5. 

Consider a process with a set of M states denoted as S= { .sq , S 2 , ■ ■ sm}, 

where each state has some distinct statistical property. In its simplest form, a 
state is just a single vector, and the finite state process is equivalent to a 
discrete-valued random process with M outcomes. In this case the Bayesian 
state estimation is identical to the Bayesian classification of a signal into 
one of M discrete-valued vectors. More generally, a state generates 
continuous-valued, or discrete-valued vectors from a pdf, or a pmf, 
associated with the state. Figure 4.18 illustrates an M-state process, where 
the output of the / th state is expressed as 
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x(m ) = h t (Of , e(m)), i= 1 , . . M 



(4.144) 



where in each state the signal x(m) is modelled as the output of a state- 
dependent function hj(-) with parameter 0^ input e(m) and an input pdf 
fEi( e {m))- The prior probability of each state is given by 



P 

s 



(s i ) = ‘E[N(s i )V ‘E 




(4.145) 



where t E[N(si)\ is the expected number of observation from state Sj. The pdf 

of the output of a finite state process is a weighted combination of the pdf of 
each state and is given by 



M 

fx Ps ($i )fx\s ( x I s i) (4.146) 

i = 1 

In Figure 4.18, the noisy observation y(m) is the sum of the process output 
x(m) and an additive noise n(m). From Bayes’ rule, the posterior probability 
of the state Sj given the observation y(m) can be expressed as 




Figure 4.18 Illustration of a random process generated by a finite state system. 
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Ps\Y ( s i\y( m )) 



fr\s (y( m i s t)P s (s t) 

M 

7=1 



(4.147) 



In MAP classification, the state with the maximum posterior probability is 
selected as 



s map (yi™)) = arg max P slY (s { \y(m )) 



(4.148) 



The Bayesian state classifier assigns a misclassification cost function 
C(sj\sj) to the action of selecting the state Sj when the true state is sj. The risk 

function for the Bayesian classification is given by 

M 

^( 5/ |y(m)) = XC(5,. \Sj)P slY ( Sj ly(m)) (4.149) 

7=1 



4.6.8 Bayesian Estimation of the Most Likely State Sequence 

Consider the estimation of the most likely state sequence 
s - fy. ,S: s’,- ] of a finite state process, given a sequence of T 

observation vectors Y = ]• A state sequence s, of length T, is 

itself a random integer- valued vector process with N T possible values. From 
the Bayes rule, the posterior pmf of a state sequence s, given an observation 
sequence Y, can be expressed as 



Ps\y (\ ’ • • •’ s i T _, I Jo » • • •’ ^r-i ) - 



/Vis (To ’ • • •’ yr-i I s i 0 » • • •> s i T _, )Ps ( s i 0 » • • s i T _ 1 ) 






(4.150) 



where Ps(s) is the pmf of the state sequence s, and for a given observation 
sequence, the denominator / F (y 0 ,...,y r _i ) is a constant. The Bayesian risk 
in selecting a state sequence s ; - is expressed as 
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a 

Figure 4.19 A three state Markov Process. 

n t 

^ i \y) = '^C(s i \ s j)P sv (s j\y) (4.151) 

j = i 

For a statistically independent process, the state of the process at any time is 
independent of the previous states, and hence the conditional probability of 
a state sequence can be written as 

T - 1 

Ps\y ( s i 0 ’• • •’ s i T _ 1 To v • •> Tr-i ) - (4.152) 

k= 0 

where denotes state Sj at time instant k. A particular case of a finite state 

process is the Markov chain where the state transition is governed by a 
Markovian process such that the probability of the state i at time m depends 
on the state of the process at time m-1. The conditional pmf of a Markov 
state sequence can be expressed as 

T - 1 

P.s\y ( s i (l v • > s i T I I JO’--- ’ Jr-i) = W.153) 

k=0 

where a. is the probability that the process moves from state s t to 
state s- Finite state random processes and computationally efficient 
methods of state sequence estimation are described in detail in Chapter 5 . 
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4.7 Modelling the Space of a Random Process 

In this section, we consider the training of statistical models for a database 
of /’-dimensional vectors of a random process. The vectors in the database 
can be visualised as forming a number of clusters or regions in a P- 
dimensional space. The statistical modelling method consists of two steps: 

(a) the partitioning of the database into a number of regions, or clusters, and 

(b) the estimation of the parameters of a statistical model for each cluster. A 
simple method for modelling the space of a random signal is to use a set of 
prototype vectors that represent the centroids of the signal space. This 
method effectively quantises the space of a random process into a relatively 
small number of typical vectors, and is known as vector quantisation (VQ). 
In the following, we first consider a VQ model of a random process, and 
then extend this model to a pdf model, based on a mixture of Gaussian 
densities. 



4.7.1 Vector Quantisation of a Random Process 

In vector quantisation, the space of a random vector process X is partitioned 
into K clusters or regions [X\, X^, ...,Xx\, and each cluster A,- is represented 
by a cluster centroid c,-. The set of centroid vectors \c\, ci, form a VQ 
code book model of the process X. The VQ code book can then be used to 
classify an unlabelled vector x with the nearest centroid. The codebook is 
searched to find the centroid vector with the minimum distance from x, then 
x is labelled with the index of the minimum distance centroid as 



Label(x)=axgmind(x,c i ) 

i 



(4.154) 



where d(x, c, ) is a measure of distance between the vectors x and c,. The 
most commonly used distance measure is the mean squared distance. 



4.7.2 Design of a Vector Quantiser: K-Means Clustering 

The A'-means algorithm, illustrated in Figure 4.20, is an iterative method for 
the design of a VQ codebook. Each iteration consists of two basic steps : (a) 
Partition the training signal space into K regions or clusters and (b) compute 
the centroid of each region. The steps in A'-Means method are as follows: 
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Select initial centroids and 
form cluster partitions 




Update cluster centroids 





Figure 4.18 Illustration of the K-means clustering method. 



Step 1: Initialisation Use a suitable method to choose a set of K initial 
centroids [c,-]. For m = 1, 2, .. . 

Step 2: Classification Classify the training vectors {x} into K clusters { [x i ] , 
[xjJ, ... [x/d) using the so-called nearest-neighbour rule Equation 
(4.154). 

Step 3: Centroid computation Use the vectors [x ; ] associated with the r th 
cluster to compute an updated cluster centroid c,-, and calculate the 
cluster distortion defined as 

1 Ni 

D i On) = — X d (x ‘ ( m )) (4. 155) 

N i 7=1 

where it is assumed that a set of Ni vectors [x,(/) j=0, ..., 7V/J are 
associated with cluster i. The total distortion is given by 

K 

D(m) = ^D i (m) 



(4.156) 
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Step 4: Convergence test: 
if 

IX m — V) — Dim ) > Threshold stop, 

else 

goto Step 2. 

A vector quantiser models the regions, or the clusters, of the signal space 
with a set of cluster centroids. A more complete description of the signal 
space can be achieved by modelling each cluster with a Gaussian density as 
described in the next chapter. 



4.8 Summary 

This chapter began with an introduction to the basic concepts in estimation 
theory; such as the signal space and the parameter space, the prior and 
posterior spaces, and the statistical measures that are used to quantify the 
performance of an estimator. The Bayesian inference method, with its 
ability to include as much information as is available, provides a general 
framework for statistical signal processing problems. The minimum mean 
square error, the maximum-likelihood, the maximum a posteriori, and the 
minimum absolute value of error methods were derived from the Bayesian 
formulation. Further examples of the applications of Bayesian type models 
in this book include the hidden Markov models for non- stationary processes 
studied in Chapter 5, and blind equalisation of distorted signals studied in 
Chapter 15. 

We considered a number of examples of the estimation of a signal 
observed in noise, and derived the expressions for the effects of using prior 
pdfs on the mean and the variance of the estimates. The choice of the prior 
pdf is an important consideration in Bayesian estimation. Many processes, 
for example speech or the response of a telecommunication channel, are not 
uniformly distributed in space, but are constrained to a particular region of 
signal or parameter space. The use of a prior pdf can guide the estimator to 
focus on the posterior space that is the subspace consistent with both the 
likelihood and the prior pdfs. The choice of the prior, depending on how 
well it fits the process, can have a significant influence on the solutions. 

The iterative estimate-maximise method, studied in Section 4.3, 
provides a practical framework for solving many statistical signal 
processing problems, such as the modelling of a signal space with a mixture 
Gaussian densities, and the training of hidden Markov models in Chapter 5. 
In Section 4.4 the Cramer-Rao lower bound on the variance of an estimator 
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was derived, and it was shown that the use of a prior pdf can reduce the 
minimum estimator variance. 

Finally we considered the modelling of a data space with a mixture 
Gaussian process, and used the EM method to derive a solution for the 
parameters of the mixture Gaussian model. 
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HIDDEN MARKOV MODELS 



5.1 Statistical Models for Non-Stationary Processes 

5.2 Hidden Markov Models 

5.3 Training Hidden Markov Models 

5.4 Decoding of Signals Using Hidden Markov Models 

5.5 HMM-Based Estimation of Signals in Noise 

5.6 Signal and Noise Model Combination and Decomposition 

5.7 HMM-Based Wiener Filters 

5.8 Summary 



H idden Markov models (HMMs) are used for the statistical modelling 
of non- stationary signal processes such as speech signals, image 
sequences and time-varying noise. An HMM models the time 
variations (and/or the space variations) of the statistics of a random process 
with a Markovian chain of state-dependent stationary subprocesses. An 
HMM is essentially a Bayesian finite state process, with a Markovian prior 
for modelling the transitions between the states, and a set of state probability 
density functions for modelling the random variations of the signal process 
within each state. This chapter begins with a brief introduction to 
continuous and finite state non-stationary models, before concentrating on 
the theory and applications of hidden Markov models. We study the various 
HMM structures, the Baum-Welch method for the maximum-likelihood 
training of the parameters of an HMM, and the use of HMMs and the 
Viterbi decoding algorithm for the classification and decoding of an 
unlabelled observation signal sequence. Finally, applications of the HMMs 
for the enhancement of noisy signals are considered. 
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Figure 5.1 Illustration of a two-layered model of a non-stationary process. 

5.1 Statistical Models for Non-Stationary Processes 

A non-stationary process can be defined as one whose statistical parameters 
vary over time. Most “naturally generated” signals, such as audio signals, 
image signals, biomedical signals and seismic signals, are non-stationary, in 
that the parameters of the systems that generate the signals, and the 
environments in which the signals propagate, change with time. 

A non-stationary process can be modelled as a double-layered 
stochastic process, with a hidden process that controls the time variations of 
the statistics of an observable process, as illustrated in Figure 5.1. In 
general, non-stationary processes can be classified into one of two broad 
categories: 

(a) Continuously variable state processes. 

(b) Finite state processes. 

A continuously variable state process is defined as one whose underlying 
statistics vary continuously with time. Examples of this class of random 
processes are audio signals such as speech and music, whose power and 
spectral composition vary continuously with time. A finite state process is 
one whose statistical characteristics can switch between a finite number of 
stationary or non-stationary states. For example, impulsive noise is a binary- 
state process. Continuously variable processes can be approximated by an 
appropriate finite state process. 

Figure 5.2(a) illustrates a non-stationary first-order autoregressive (AR) 
process. This process is modelled as the combination of a hidden stationary 
AR model of the signal parameters, and an observable time-varying AR 
model of the signal. The hidden model controls the time variations of the 
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Figure 5.2 (a) A continuously variable state AR process, (b) A binary-state AR 
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parameters of the non- stationary AR model. For this model, the observation 
signal equation and the parameter state equation can be expressed as 



x(m) = a(m)x(m — 1) +e(m) 


Observation equation 


(5.1) 


a(m)—j5a(m - 1 )+e(m) 


Hidden state equation 


(5.2) 



where a(m) is the time- varying coefficient of the observable AR process and 
/ 3 is the coefficient of the hidden state-control process. 

A simple example of a finite state non- stationary model is the binary- 
state autoregressive process illustrated in Figure 5.2(b), where at each time 
instant a random switch selects one of the two AR models for connection to 
the output terminal. For this model, the output signal x(m) can be expressed 
as 

— ) (5.3) 

where the binary switch s(m ) selects the state of the process at time m, and 
s(m) denotes the Boolean complement of s(m). 
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Figure 5.3 (a) Illustration of a two-layered random process, (b) An HMM model of 

the process in (a). 



5.2 Hidden Markov Models 

A hidden Markov model (HMM) is a double-layered finite state process, 
with a hidden Markovian process that controls the selection of the states of 
an observable process. As a simple illustration of a binary-state Markovian 
process, consider Figure 5.3, which shows two containers of different 
mixtures of black and white balls. The probability of the black and the white 
balls in each container, denoted as P B and P w respectively, are as shown 

above Figure 5.3. Assume that at successive time intervals a hidden 
selection process selects one of the two containers to release a ball. The 
balls released are replaced so that the mixture density of the black and the 
white balls in each container remains unaffected. Each container can be 
considered as an underlying state of the output process. Now for an example 
assume that the hidden container-selection process is governed by the 
following rule: at any time, if the output from the currently selected 
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container is a white ball then the same container is selected to output the 
next ball, otherwise the other container is selected. This is an example of a 
Markovian process because the next state of the process depends on the 
current state as shown in the binary state model of Figure 5.3(b). Note that 
in this example the observable outcome does not unambiguously indicate 
the underlying hidden state, because both states are capable of releasing 
black and white balls. 

In general, a hidden Markov model has N sates, with each state trained 
to model a distinct segment of a signal process. A hidden Markov model can 
be used to model a time-varying random process as a probabilistic 
Markovian chain of N stationary, or quasi-stationary, elementary sub- 
processes. A general form of a three-state HMM is shown in Figure 5.4. 
This structure is known as an ergodic HMM. In the context of an HMM, the 
term “ergodic” implies that there are no structural constraints for connecting 
any state to any other state. 

A more constrained form of an HMM is the left-right model of Figure 
5.5, so-called because the allowed state transitions are those from a left state 
to a right state and the self-loop transitions. The left-right constraint is 
useful for the characterisation of temporal or sequential structures of 
stochastic signals such as speech and musical signals, because time may be 
visualised as having a direction from left to right. 




Figure 5.4 A three-state ergodic HMM structure. 
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Figure 5.5 A 5-state left-right HMM speech model. 



5.2.1 A Physical Interpretation of Hidden Markov Models 

For a physical interpretation of the use of HMMs in modelling a signal 
process, consider the illustration of Figure 5.5 which shows a left-right 
HMM of a spoken letter “C”, phonetically transcribed as ‘s-iy’, together 
with a plot of the speech signal waveform for “C”. In general, there are two 
main types of variation in speech and other stochastic signals: variations in 
the spectral composition, and variations in the time-scale or the articulation 
rate. In a hidden Markov model, these variations are modelled by the state 
observation and the state transition probabilities. A useful way of 
interpreting and using HMMs is to consider each state of an HMM as a 
model of a segment of a stochastic process. For example, in Figure 5.5, state 
5 1 models the first segment of the spoken letter “C”, state S 2 models the 

second segment, and so on. Each state must have a mechanism to 
accommodate the random variations in different realisations of the segments 
that it models. The state transition probabilities provide a mechanism for 
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connection of various states, and for the modelling the variations in the 
duration and time-scales of the signals in each state. For example if a 
segment of a speech utterance is elongated, owing, say, to slow articulation, 
then this can be accommodated by more self-loop transitions into the state 
that models the segment. Conversely, if a segment of a word is omitted, 
owing, say, to fast speaking, then the skip-next- state connection 
accommodates that situation. The state observation pdfs model the 
probability distributions of the spectral composition of the signal segments 
associated with each state. 



5.2.2 Hidden Markov Model as a Bayesian Model 

A hidden Markov model M is a Bayesian structure with a Markovian state 
transition probability and a state observation likelihood that can be either a 
discrete pmf or a continuous pdf. The posterior pmf of a state sequence s of 
a model M, given an observation sequence X, can be expressed using Bayes’ 

rule as the product of a state prior pmf and an observation likelihood 
function: 

Ps\x,m ( s l X, =— Psm (s| M)f X \s,M (X \ s > M) (5.4) 

J x ( A ) 

where the observation sequence X is modelled by a probability density 
function Ps\xx^ s \X,M). 

The posterior probability that an observation signal sequence X was 
generated by the model M is summed over all likely state sequences, and 

may also be weighted by the model prior ( ^0 • 

/w(«|x)=-4— x Psm (s\M) f x\s,m(X\s,M) (5.5) 

j x va ) ' — v — y ' v ' ' v ' 

Model prior State prior Observation likelihood 



The Markovian state transition prior can be used to model the time 
variations and the sequential dependence of most non-stationary processes. 
However, for many applications, such as speech recognition, the state 
observation likelihood has far more influence on the posterior probability 
than the state transition prior. 
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5.2.3 Parameters of a Hidden Markov Model 

A hidden Markov model has the following parameters: 

Number of states N. This is usually set to the total number of distinct, or 
elementary, stochastic events in a signal process. For example, in 
modelling a binary-state process such as impulsive noise, N is set to 2, 
and in isolated-word speech modelling N is set between 5 to 10. 

State transition-probability matrix A={ajj, ij= 1 , ... N}. This provides a 

Markovian connection network between the states, and models the 
variations in the duration of the signals associated with each state. For 
a left-right HMM (see Figure 5.5), a (/ =0 for i>j, and hence the 

transition matrix A is upper-triangular. 

State observation vectors {ju«i. Hi 2 , ..., i UiM, i= 1, ..., N}. For each state a set 
of M prototype vectors model the centroids of the signal space 
associated with each state. 

State observation vector probability model. This can be either a discrete 
model composed of the M prototype vectors and their associated 
probability mass function (pmf) P={Pjj(- ); i= 1, ..., N,j= 1, ... M}, or it 
may be a continuous (usually Gaussian) pdf model F=[/^(-); i= 1, ..., 

Initial state probability vector Ki, ..., 7t^\. 

5.2.4 State Observation Models 

Depending on whether a signal process is discrete-valued or continuous- 
valued, the state observation model for the process can be either a discrete- 
valued probability mass function (pmf), or a continuous-valued probability 
density function (pdf). The discrete models can also be used for the 
modelling of the space of a continuous-valued process quantised into a 
number of discrete points. First, consider a discrete state observation density 
model. Assume that associated with the i th state of an HMM there are M 
discrete centroid vectors [fin, /Him] with a pmf [Pn, Pm \ • These 

centroid vectors and their probabilities are normally obtained through 
clustering of a set of training signals associated with each state. 
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Figure 5.6 Modelling a random signal space using (a) a discrete-valued pmf 
and (b) a continuous-valued mixture Gaussian density. 




For the modelling of a continuous-valued process, the signal space 
associated with each state is partitioned into a number of clusters as in 
Figure 5.6. If the signals within each cluster are modelled by a uniform 
distribution then each cluster is described by the centroid vector and the 
cluster probability, and the state observation model consists of M cluster 
centroids and the associated pmf {fi^, P^; i= 1, ..., N, k= 1, ..., M }. In effect, 

this results in a discrete state observation HMM for a continuous-valued 
process. Figure 5.6(a) shows a partitioning, and quantisation, of a signal 
space into a number of centroids. 

Now if each cluster of the state observation space is modelled by a 
continuous pdf, such as a Gaussian pdf, then a continuous density HMM 
results. The most widely used state observation pdf for an HMM is the 
mixture Gaussian density defined as 




M 

s = i)=Z p i M x ^k,Zik) 



k=\ 



(5.6) 



where fA [Xx,fi ik ,Z ik ) is a Gaussian density with mean vector j Uik and 

covariance matrix and P % is a mixture weighting factor for the k lh 
Gaussian pdf of the state i. Note that P^ is the prior probability of the k lh 

mode of the mixture pdf for the state i. Figure 5.6(b) shows the space of a 
mixture Gaussian model of an observation signal space. A 5-mode mixture 
Gaussian pdf is shown in Figure 5.7. 
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5.2.5 State Transition Probabilities 

The first-order Markovian property of an HMM entails that the transition 
probability to any state s(t) at time t depends only on the state of the process 
at time t— 1, s(t-\), and is independent of the previous states of the HMM. 
This can be expressed as 

Prob(s{t) — j\s(t — 1) = i, s(t — 2) — k,...,s(t — N) — l) 

-Prob(s(t) — j\s(t — l) = i)=a i j ( -7) 



where s(t ) denotes the state of HMM at time t. The transition probabilities 
provide a probabilistic mechanism for connecting the states of an HMM, 
and for modelling the variations in the duration of the signals associated 
with each state. The probability of occupancy of a state i for d consecutive 
time units, Pjid), can be expressed in terms of the state self-loop transition 
probabilities as 

P i (d) = a? i -\l-a ii ) (5.8) 

From Equation (5.8), using the geometric series conversion formula, the 
mean occupancy duration for each state of an HMM can be derived as 

oo 

Mean occupancy of state / = 'Sd P t (d) = 



(5.9) 
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Time 




Figure 5.8 (a) A 4-state left-right HMM, and (b) its state-time trellis diagram. 



5.2.6 State-Time Trellis Diagram 

A state-time trellis diagram shows the HMM states together with all the 
different paths that can be taken through various states as time unfolds. 
Figure 5.8(a) and 5.8(b) illustrate a 4-state HMM and its state-time 
diagram. Since the number of states and the state parameters of an HMM are 
time-invariant, a state-time diagram is a repetitive and regular trellis 
structure. Note that in Figure 5.8 for a left-right HMM the state-time trellis 
has to diverge from the first state and converge into the last state. In general, 
there are many different state sequences that start from the initial state and 
end in the final state. Each state sequence has a prior probability that can be 
obtained by multiplication of the state transition probabilities of the 
sequence. For example, the probability of the state sequence 

s [5*1 ,5*2 ,*52 ,*5^ ,»S^ ] is P(s')— 1 ^ 12 ^ 22 ^ 23 ^ 33 ^ 34 * Since each state has 

a different set of prototype observation vectors, different state sequences 
model different observation sequences. In general an TV-state HMM can 
reproduce N r different realisations of the random process that it is trained to 
model. 
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5.3 Training Hidden Markov Models 

The first step in training the parameters of an HMM is to collect a training 
database of a sufficiently large number of different examples of the random 
process to be modelled. Assume that the examples in a training database 
consist of L vector-valued sequences [XJ=[A/ t ; k= 0, ..., L-l], with each 
sequence c(t); t= 0, ..., 7\--l ] having a variable number of 7\ vectors. 

The objective is to train the parameters of an HMM to model the statistics of 
the signals in the training data set. In a probabilistic sense, the fitness of a 
model is measured by the posterior probability P^^MX) of the model M 

given the training data X. The training process aims to maximise the 
posterior probability of the model M and the training data [A], expressed 

using Bayes’ rule as 



Pm\x 0^1 X ) 



1 

fx(X) 



fx w(X\M)Pm(M) 



(5.10) 



where the denominator fx( X) on the right-hand side of Equation (5.10) has 
only a normalising effect and P^M) is the prior probability of the model Ovl. 
For a given training data set [X] and a given model M, maximising Equation 
(5.10) is equivalent to maximising the likelihood function P x ^X\tM). The 

likelihood of an observation vector sequence X given a model M can be 
expressed as 

f xm x\s m (^| s,M)P sm {s\M) (5.11) 

S 



where f X \s x fX{t)\s(t),M), the pdf of the signal sequence X along the state 
sequence s =[s(0),s(l),. . .,s(T -1)] of the model fM, is given by 

/xis, 5 w(^l s ’ = fx\s ( x (0)|- s ’(0))/xi,s’ (x(l)|s(l))- • • f X \s (x(T - 1)|5(T - 1)) 

(5.12) 

where s(t), the state at time t, can be one of N states, and f xs (X(r)\s(t)), a 
shorthand fov f X \ S AX(t)\s(t),9v1), is the pdf of x(t) given the state s(t) of the 
model Ovl. The Markovian probability of the state sequence s is given by 

P Sw( s \M)= • 7r s(0) a 5(0)i(l) a s(l)5(2)"' a s(T-2)s(T-l) 



(5.13) 
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Substituting Equations (5.12) and (5.13) in Equation (5.11) yields 

f xw(X I M Ix\s,m(X I s,fM)P sW (s I M ) 

S 

-^J 71 s(0)f X\S (*(0)k(0)) a s(0)s(l) fx\S ( x (l)k(l))'” a s(T-2)s(T-\)f X\S ( X ( T ~ 1 )| S (T ~ 1 )) 
s 

(5.14) 

where the summation is taken over all state sequences s. In the training 
process, the transition probabilities and the parameters of the observation 
pdfs are estimated to maximise the model likelihood of Equation (5.14). 
Direct maximisation of Equation (5.14) with respect to the model 
parameters is a non-trivial task. Furthermore, for an observation sequence of 
length T vectors, the computational load of Equation (5.14) is 0(N T ). This is 
an impractically large load, even for such modest values as N=6 and 7=30. 
However, the repetitive structure of the trellis state-time diagram of an 
HMM implies that there is a large amount of repeated computation in 
Equation (5.14) that can be avoided in an efficient implementation. In the 
next section we consider the forward-backward method of model likelihood 
calculation, and then proceed to describe an iterative maximum-likelihood 
model optimisation method. 



5.3.1 Forward-Backward Probability Computation 

An efficient recursive algorithm for the computation of the likelihood 
function f X y M (X\ _Tf) is the forward-backward algorithm. The forward- 

backward computation method exploits the highly regular and repetitive 
structure of the state-time trellis diagram of Figure 5.8. 

In this method, a forward probability variable a t (i) is defined as the 
joint probability of the partial observation sequence -X=[jc(0), x ( 1 ), ..., x(7)J 
and the state i at time t, of the model fM: 

a t (0=/x,si5w-(*(°)’ *(!)>••• > x(t), j(0 = i | M) (5.15) 

The forward probability variable a t (i) of Equation (5.15) can be expressed 
in a recursive form in terms of the forward probabilities at time t— 1, 
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Time t 

Figure 5.9 A network for computation of forward probabilities for a left-right HMM. 



a t (0 = fx,smi x (°^ *(!)> • • • , x(t), s(t ) = ijfW) 

{ TV \ 

= X fx,smi x (° X • • • >x(t - 1), s(t - 1) = ,/jTf)a /( /.yis ) !M-(*(0|s(0 = i, m) 

v > =1 J 

TV 

= a ./J/AriS,Sif(*(0|s(0 = *,**) 

7=1 

(5.16) 

Figure 5.9 illustrates, a network for computation of the forward probabilities 
for the 4-state left-right HMM of Figure 5.8. The likelihood of an 
observation sequence X=|x(0), x( 1 ), ..., x(T -\ )] given a model M can be 

expressed in terms of the forward probabilities as 

N 

fxw( x (Q)’ x W’---’ x ( T -l)\M) = Y l fx,sw( x ( () )’ x ( l )’---’ x ( T - 1 )’ s(T~*X) = i\M) 

i = 1 
N 

= ^j a T- 1(0 
/=! 

(5.17) 

Similar to the definition of the forward probability concept, a backward 
probability is defined as the probability of the state i at time t followed by 
the partial observation sequence [jc(h-I), x(t+ 2), ..., xCT-1 )] as 
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Pt (0 = fx,s |5tf00) = 0*0 + 1), *(f + 2), . . . , x(T - 1)| M) 



N 

= s a ij fx,s\M 00 + 1) = jMt + 2), x(t + 3), . . . , x(T - 1)) 

7=1 



x fx\s (*0 + 1)00 + 1) = j,\M) 

N 

=Yj a ijPt+\ U)fx\sM xU + l M { + o= i’W) 

7=1 



(5.18) 



In the next section, forward and backward probabilities are used to develop 
a method for the training of HMM parameters. 



5.3.2 Baum-Welch Model Re-Estimation 



The HMM training problem is the estimation of the model parameters 
tM=(n, A, F ) for a given data set. These parameters are the initial state 

probabilities n, the state transition probability matrix A and the continuous 
(or discrete) density state observation pdfs. The HMM parameters are 
estimated from a set of training examples {^ l =[ at ( 0 ), ..., xCT-1 )] }, with the 
objective of maximising f X \, M (X\M), the likelihood of the model and the 

training data. The Baum-Welch method of training HMMs is an iterative 
likelihood maximisation method based on the forward-backward 
probabilities defined in the preceding section. The Baum-Welch method is 
an instance of the EM algorithm described in Chapter 4. For an HMM 

the posterior probability of a transition at time t from state i to state j of the 
model tM, given an observation sequence X, can be expressed as 



7, (h j) = P S \xm( s (!) = b s (* + !) = j\ x >M) 

_ fs,x w (K0 = i,s(t + l) = j,X\M) 

fxw( x \ M ) 

ox f X \s,A x ^ + oko + 1) = j,M)p t+l ( j ) 



(5.19) 



N 



^' J C%T - 1 (0 



i = 1 



where fs,xm( s (0 = h s 0 + 1) = j*X \M) is the joint pdf of the states s(t ) and 




158 



Hidden Markov Models 



s(t+ 1 ) and the observation sequence X, and fx\s ( x ( ? + 1)| 5 (^ + 1) - 0 is the 

state observation pdf for the state i. Note that for a discrete observation 
density HMM the state observation pdf in Equation (5.19) is replaced with 

the discrete state observation pmf Px\s ( x ( f + 1)| 5 ( ? + 1) = 0 . The posterior 
probability of state i at time t given the model M and the observation X is 



II 

- 

II 


iX,Xt) 


II 

>< 

< 

i 


i,X\M) 


fxJX- 


m) 



a t (i)P t (i) 

M 



(5.20) 



Now the state transition probability ay can be interpreted as 

_ expected number of transitions from state i to state j 
J expected number of transitions from state i 



(5.21) 



From Equations (5.1 9)^(5. 21), the state transition probability can be re- 
estimated as the ratio 



T-2 

Yr,0'J) 



V T-2 



Sr,< o 

t = 0 



(5.22) 



Note that for an observation sequence |x(0), ..., odT-l )J of length T, the last 
transition occurs at time T-2 as indicated in the upper limits of the 
summations in Equation (5.22). The initial- state probabilities are estimated 
as 



n, = 7 0 (0 



(5.23) 
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5.3.3 Training HMMs with Discrete Density Observation Models 

In a discrete density HMM, the observation signal space for each state is 
modelled by a set of discrete symbols or vectors. Assume that a set of M 
vectors [/J-ib/J-a, •••, /J-m] model the space of the signal associated with the I th 
state. These vectors may be obtained from a clustering process as the 
centroids of the clusters of the training signals associated with each state. 
The objective in training discrete density HMMs is to compute the state 
transition probabilities and the state observation probabilities. The forward- 
backward equations for discrete density HMMs are the same as those for 
continuous density HMMs, derived in the previous sections, with the 
difference that the probability density functions such as f X \s (x(t)\s(t) - i ) 

are substituted with probability mass functions P X \ S (x(t)\s(t) - i) defined 

as 

p x\s (*(0|s(0 = i)= p x\s G2[*(fXH0 = 0 (5.24) 



where the function Q[x(t)\ quantises the observation vector x(l) to the 
nearest discrete vector in the set [/J-ib/J-a, ■■■, /TmJ • For discrete density 
HMMs, the probability of a state vector can be defined as the ratio of the 
number of occurrences of (or vectors quantised to ) in the state i, 
divided by the total number of occurrences of all other vectors in the state i: 



P ik ( Hik ) 



expected number of times in state i and observing jl lk 
expected number of times in state i 



T - 1 

Xr,( o 

T - 1 

2>,(o 

t= 0 



(5.25) 



In Equation (5.25) the summation in the numerator is taken over those time 
instants t where the k th symbol is observed in the state i. 

For statistically reliable results, an HMM must be trained on a large 
data set X consisting of a sufficient number of independent realisations of 
the process to be modelled. Assume that the training data set consists of L 
realisations X=[X({\), X( 1 ), ..., X(L-\)\, where A(/c)=|x(0), x( 1 ), ..., x(T^- 

1)]. The re-estimation formula can be averaged over the entire data set as 
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L-i T, -2 

ZZrloj) 



1=0 t = 0 



L-l T, -2 



1 1 y/ (0 

/=0 r=0 



and 



*/(/*«*) = 



L-l T, -1 

I Xr/(i) 

/— 0 — ^J^ik 

L-l T, -l 

2Xr/(0 

Z=0 r=0 



(5.26) 



(5.27) 



(5.28) 



The parameter estimates of Equations (5.26)-(5.28) can be used in further 
iterations of the estimation process until the model converges. 



5.3.4 HMMs with Continuous Density Observation Models 



In continuous density HMMs, continuous probability density functions 
(pdfs) are used to model the space of the observation signals associated with 
each state. Baum et al. generalised the parameter re-estimation method to 
HMMs with concave continuous pdfs such a Gaussian pdf. A continuous P- 
variate Gaussian pdf for the state i of an HMM can be defined as 



/xls(*(Ok(O = 0 = 



1 

(2n) PI2 J, 



1/2 



exp {[x(t)-Hif Zi'ixit)- fy]' 



(5.29) 



where /i ( and Zj are the mean vector and the covariance matrix associated 
with the state i. The re-estimation formula for the mean vector of the state 
Gaussian pdf can be derived as 
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T - 1 

X Yf(i)x(t ) 

_ f=0 

A*/ (5.30) 

XftW 

t=0 



Similarly, the covariance matrix is estimated as 




r-i 

X y t (i)(x(t) - )(x(0 - ^ ) T 

t = 0 

r-i 

X ft w 

/=o 



(5.31) 



The proof that the Baum-Welch re-estimation algorithm leads to 
maximisation of the likelihood function f x ^X\!JV[) can be found in Baum. 

5.3.5 HMMs with Mixture Gaussian pdfs 

The modelling of the space of a signal process with a mixture of Gaussian 
pdfs is considered in Section 4.5. In HMMs with mixture Gaussian pdf state 
models, the signal space associated with the z' th state is modelled with a 
mixtures of M Gaussian densities as 

M 

fx\S (VOKO = 0=X P \k ^VX x (0> Pik » ^ik ) (5.32) 

k = 1 



where Pjk is the prior probability of the /c th component of the mixture. The 
posterior probability of state i at time t and state j at time t + 1 of the model 
M, given an observation sequence X=[x(0), ..., x(T-\ )J, can be expressed as 



j)~ P S\XM 


II 

<?***. 

+ 

II 

X 

& 






r m ' 




0- l (i)a lj 


^ p jk yi(x(t+i),iij k ,Zj k ) 


P, + iO') 





_k=\ 





N 



X a r-i(0 

i = 1 



(5.33) 
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and the posterior probability of state i at time t given the model M and the 
observation X is given by 



Yt (0 - Ps\x,m ( s ( t ) — i\X, fM) 

= a t (i)P t (i) 

N 

X a r-iO) 

7=1 



(5.34) 



Now we define the joint posterior probability of the state i and the k th 
Gaussian mixture component pdf model of the state i at time t as 

^ t (i,k) = P StKlXM (s(t) = i,m(t) = k\X,‘\() 

N 

X a f-i U) a a i ik , x ik )p f (o 

_ 7=i (5.35) 

_ N 

X a r- iU) 

7=1 



where m(t) is the Gaussian mixture component at time t. Equations (5.33) to 
(5.35) are used to derive the re-estimation formula for the mixture 
coefficients, the mean vectors and the covariance matrices of the state 
mixture Gaussian pdfs as 




(5.37) 
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Similarly the covariance matrix is estimated as 



X & (i ' ~ Pik ][*(0 - Pik f 

Zik =— p (5.38) 

t = o 



5.4 Decoding of Signals Using Hidden Markov Models 

Hidden Markov models are used in applications such as speech recognition, 
image recognition and signal restoration, and for the decoding of the 
underlying states of a signal. For example, in speech recognition, HMMs are 
trained to model the statistical variations of the acoustic realisations of the 
words in a vocabulary of say size V words. In the word recognition phase, 
an utterance is classified and labelled with the most likely of the V+l 
candidate HMMs (including an HMM for silence) as illustrated in Figure 
5.10. In Chapter 12 on the modelling and detection of impulsive noise, a 
binary-state HMM is used to model the impulsive noise process. 

Consider the decoding of an unlabelled sequence of T signal vectors 
X=[x(0) 5 *(1), ..., X(T- 1)] given a set of V candidate HMMs iMy], 

The probability score for the observation vector sequence X and the model 
<M k can be calculated as the likelihood: 



f xmiX-Y^k)- S ;r i(0)Tr|s( x (0)k(0)) a i(0)i(i)/x|s( x (l)b(l)}"« i (r-2)j(r-i)/x|s( x f7’ _ 1 _ b) 

S 

(5.39) 

where the likelihood of the observation sequence X is summed over all 
possible state sequences of the model M. Equation (5.39) can be efficiently 

calculated using the forward-backward method described in Section 5.3.1. 
The observation sequence X is labelled with the HMM that scores the 
highest likelihood as 

Label(X)=aigmax(f xw (X\M k )), k=l,...,V+l (5.40) 

k 

In decoding applications often the likelihood of an observation sequence X 
and a model M k is obtained along the single most likely state sequence of 
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Figure 5.10 Illustration of the use of HMMs in speech recognition. 



model $4, instead of being summed over all sequences, so Equation (5.40) 
becomes 



Label(X )= arg max 

k 



max f x 

_ s 



S |5W 






(5.41) 



In Section 5.5, on the use of HMMs for noise reduction, the most likely state 
sequence is used to obtain the maximum-likelihood estimate of the 
underlying statistics of the signal process. 
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Figure 5.11 A network illustration of the Viterbi algorithm. 



5.4.1 Viterbi Decoding Algorithm 

In this section, we consider the decoding of a signal to obtain the maximum 
a posterior (MAP) estimate of the underlying state sequence. The MAP state 
sequence s MAP of a model M given an observation signal sequence X=|x(0), 

at(T-I)] is obtained as 



MAP 

S 



= arg max f x s ( , M ( A ,s | M) 

S 

= arg max (f x \ s M (X |s , M)P S , ^ (s| M)) 



(5.42) 



The MAP state sequence estimate is used in such applications as the 
calculation of a similarity score between a signal sequence X and an HMM 
M, segmentation of a non-stationary signal into a number of distinct quasi- 
stationary segments, and implementation of state-based Wiener filters for 
restoration of noisy signals as described in the next section. 

For an A-state HMM and an observation sequence of length T, there are 
altogether N T state sequences. Even for moderate values of N and T say 
(N=6 and 7=30), an exhaustive search of the state-time trellis for the best 
state sequence is a computationally prohibitive exercise. The Viterbi 
algorithm is an efficient method for the estimation of the most likely state 
sequence of an HMM. In a state-time trellis diagram, such as Figure 5.8, the 
number of paths diverging from each state of a trellis can grow 
exponentially by a factor of N at successive time instants. The Viterbi 
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method prunes the trellis by selecting the most likely path to each state. At 
each time instant t, for each state i, the algorithm selects the most probable 
path to state i and prunes out the less likely branches. This procedure 
ensures that at any time instant, only a single path survives into each state of 
the trellis. 

For each time instant t and for each state i, the algorithm keeps a record 
of the state j from which the maximum-likelihood path branched into i, and 
also records the cumulative probability of the most likely path into state i at 
time t. The Viterbi algorithm is given on the next page, and Figure 5.11 
gives a network illustration of the algorithm. 

Viterbi Algorithm 

8 t (0 records the cumulative probability of the best path to state i at time t. 
yr, (0 records the best state sequence to state i at time t. 

Step 1: Initialisation, at time t= 0, for states i= 1 , . .., N 
<5 0 (i)=/r, /,(*(())) 

Wo (0=0 



Step 2: Recursive calculation of the ML state sequences and their 
probabilities 
For time t =1, ..., T - 1 
For states i = 1, . .., N 

8 t (0=max [8 t _ x 0> /; ]/,■ (x(f)) 
j 

W t (0=arg max (j)a Jt ] 

j 

Step 3: Termination, retrieve the most likely final state 

S MAP (T _ 1} =arg max 

i 

Prob mnx = max [S T - 1 (01 



Step 4: Backtracking through the most likely state sequence: 
For t = T- 2, ..., 0 

MAP / a MAP . txl 

(t)=Vi+\ b 0 + 0J. 
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The backtracking routine retrieves the most likely state sequence of the 
model 94. Note that the variable Prob max , which is the probability of the 
observation sequence -X=[x(0), x(T -\ )] and the most likely state 

sequence of the model 94, can be used as the probability score for the model 
94 and the observation X. For example, in speech recognition, for each 

candidate word model the probability of the observation and the most likely 
state sequence is calculated, and then the observation is labelled with the 
word that achieves the highest probability score. 



5.5 HMM-Based Estimation of Signals in Noise 

In this section, and the following two sections, we consider the use of 
HMMs for estimation of a signal x(t ) observed in an additive noise nit), and 
modelled as 



y(t) = x(t)+n(t) (5.43) 

From Bayes’ rule, the posterior pdf of the signal x(t) given the noisy 
observation y(t) is defined as 



f X\Y U(0|j(0)- 



fy \x (j(0 x 4))f x ( x (0) 



1 



f Y (y(t)) 



fy(y(0) 

■fiv(y(t)-x(t))fx (*(*)) 



(5.44) 



For a given observation, /y(y(t)) is a constant, and the maximum a posteriori 
(MAP) estimate is obtained as 

x MAP (t)= arg max f N ( y(t ) - x(t))f x (x(t)) (5 45) 

x(t) 

The computation of the posterior pdf, Equation (5.44), or the MAP estimate 
Equation (5.45), requires the pdf models of the signal and the noise 
processes. Stationary, continuous-valued, processes are often modelled by a 
Gaussian or a mixture Gaussian pdf that is equivalent to a single-state 
HMM. For a non-stationary process an A- state HMM can model the time- 
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varying pdf of the process as a Markovian chain of N stationary Gaussian 
subprocesses. Now assume that we have an A^-state HMM tM for the signal, 

and another A^-state HMM 77 for the noise. For signal estimation, we need 
estimates of the underlying state sequences of the signal and the noise 

processes. For an observation sequence of length T, there are Nj possible 

signal state sequences and N n possible noise state sequences that could 

have generated the noisy signal. Since it is assumed that the signal and noise 
are uncorrelated, each signal state may be observed in any noisy state; 

therefore the number of noisy signal states is on the order of N s x N n . 

Given an observation sequence F=[y(0), y( 1 ), ..., y(T’-l)], the most 
probable state sequences of the signal and the noise HMMs maybe 
expressed as 



MAP _ 
$ signal " 


= arg max 


/ 

max 

ri 


fr 


{y 9*^ s ig n al noise 


^,77) 


(5.46) 
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\ 3 noise 
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0 noise 
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(y ^signal noise 
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= arg max 
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fy 


^,77) 


(5.47) 




s noise 


^ s signal 




) 





Given the state sequence estimates for the signal and the noise models, the 
MAP estimation Equation (5.45) becomes 



x 



MAP 



(t) = arg max(f N s (y(f) - x(t) 



MAP n 1 1 
$ noise X\S M 



)fx\SM'( x ( [ '> ^sigrml ’ ^)) 



(5.48) 



Implementation of Equations (5.46)-(5.48) is computationally prohibitive. 
In Sections 5.6 and 5.7, we consider some practical methods for the 
estimation of signal in noise. 

Example Assume a signal, modelled by a binary-state HMM, is observed 
in an additive stationary Gaussian noise. Let the noisy observation be 
modelled as 



y(t) = s (t)*o (0+s(0*i (t)+n(t) (5.49) 



where s(t) is a hidden binary-state process such that: sit) = 0 indicates that 
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the signal is from the state Sq with a Gaussian pdf of 0\i(x(t),^i x ,2 X ) , 

and s(t ) = 1 indicates that the signal is from the state Si with a Gaussian pdf 

of 9\C(x(t),[x x ,Z x ) , Assume that a stationary Gaussian process 

j\C(n(t),fj, n ,Z nn ) , equivalent to a single-state HMM, can model the noise. 

Using the Viterbi algorithm the maximum a posteriori (MAP) state 
sequence of the signal model can be estimated as 



MAP 

I 

signal 



=arg max [fr| S (Fls, M)P s \ m {s\M)_ 



(5.50) 



For a Gaussian-distributed signal and additive Gaussian noise, the 
observation pdf of the noisy signal is also Gaussian. Hence, the state 
observation pdfs of the signal model can be modified to account for the 
additive noise as 

fy\s 0 (y(t)M= ^C(y (0,(M* 0 +M«X (Zx 0 x 0 + ^nn)) (5.51) 

and 

fr\s , +Li n l(Z XlXl +Z„ n )) (5.52) 



where fA denotes a Gaussian pdf with mean vector /r and 

covariance matrix X . The MAP signal estimate, given a state sequence 
estimate s MAP , is obtained from 



x 



MAP 



(f)=argmax[/ X | 5 M [x{t) s MAP ,Xt)f N (y(t)-x(t)) 



(5.53) 



Substitution of the Gaussian pdf of the signal from the most likely state 
sequence, and the pdf of noise, in Equation (5.53) results in the following 
MAP estimate: 



MAF (t ) = (I 



i y 

xx,s(t ) ' ^ nn 



y l z 



xx,s(t ) 



(y(t)-/u n )+ (E 



i y 

xx,s(t ) ' ^nn 



ft 



nn 



x,s(t) 

(5.54) 



where jl \ S (t) an d Z xx v(n are the mean vector and covariance matrix of the 
signal x(t) obtained from the most likely state sequence [.s(r)J . 
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Figure 5.12 Outline configuration of HMM-based noisy speech recognition and 

enhancement. 



5.6 Signal and Noise Model Combination and Decomposition 

For Bayesian estimation of a signal observed in additive noise, we need to 
have an estimate of the underlying statistical state sequences of the signal 
and the noise processes. Figure 5.12 illustrates the outline of an HMM- 
based noisy speech recognition and enhancement system. The system 
performs the following functions: 

(1) combination of the speech and noise HMMs to form the noisy 
speech HMMs; 

(2) estimation of the best combined noisy speech model given the 
current noisy speech input; 

(3) state decomposition, i.e. the separation of speech and noise states 
given noisy speech states; 

(4) state-based Wiener filtering using the estimates of speech and noise 
states. 

5.6.1 Hidden Markov Model Combination 

The performance of HMMs trained on clean signals deteriorates rapidly in 
the presence of noise, since noise causes a mismatch between the clean 
HMMs and the noisy signals. The noise-induced mismatch can be reduced: 
either by filtering the noise from the signal (for example using the Wiener 
filtering and the spectral subtraction methods described in Chapters 6 and 
11) or by combining the noise and the signal models to model the noisy 
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signal. The model combination method was developed by Gales and Young. 
In this method HMMs of speech are combined with an HMM of noise to 
form HMMs of noisy speech signals. In the power- spectral domain, the 
mean vector and the covariance matrix of the noisy speech can be 
approximated by adding the mean vectors and the covariance matrices of 
speech and noise models: 



Py=Px+8Hn (5-55) 

^ yy = E xx T 8 ^nn (5.56) 

Model combination also requires an estimate of the current signal-to-noise 
ratio for calculation of the scaling factor g in Equations (5.55) and (5.56). In 
cases such as speech recognition, where the models are trained on cepstral 
features, the model parameters are first transformed from cepstral features 
into power spectral features before using the additive linear combination 
Equations (5.55) and (5.56). Figure 5.13 illustrates the combination of a 4- 
state left-right HMM of a speech signal with a 2- state ergodic HMM of 
noise. Assuming that speech and noise are independent processes, each 
speech state must be combined with every possible noise state to give the 
noisy speech model. It is assumed that the noise process only affects the 
mean vectors and the covariance matrices of the speech model; hence the 
transition probabilities of the speech model are not modified. 



5.6.2 Decomposition of State Sequences of Signal and Noise 

The HMM-based state decomposition problem can be stated as follows: 
given a noisy signal and the HMMs of the signal and the noise processes, 
estimate the underlying states of the signal and the noise. 

HMM state decomposition can be obtained using the following method: 

(a) Given the noisy signal and a set of combined signal and noise 
models, estimate the maximum-likelihood (ML) combined noisy 
HMM for the noisy signal. 

(b) Obtain the ML state sequence of from the ML combined model. 

(c) Extract the signal and noise states from the ML state sequence of the 
ML combined noisy signal model. 

The ML state sequences provide the probability density functions for the 
signal and noise processes. The ML estimates of the speech and noise pdfs 
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Noisy speech model 

Figure 5.13 Outline configuration of HMM-based noisy speech recognition and 
enhancement. S,y is a combination of the state / of speech with the state j of noise. 




may then be used in Equation (5.45) to obtain a MAP estimate of the speech 
signal. Alternatively the mean spectral vectors of the speech and noise from 
the ML state sequences can be used to program a state-dependent Wiener 
filter as described in the next section. 



5.7 HMM-Based Wiener Filters 



The least mean square error Wiener filter is derived in Chapter 6. For a 
stationary signal x(m), observed in an additive noise n(m), the Wiener filter 
equations in the time and the frequency domains are derived as : 




(5.56) 






HMM-Based Wiener Filters 



173 



Signal HMM Noise HMM 




Noisy Signal 

o — 



i 



i 



Model Combination 




ML Model Estimation and 
State Decomposition 



P xx(f) 






W(f) = 



P xx(f) 



p xx (f)+p NN (f) 



NN 






Wiener Filter Sequence 



Figure 5.14 Illustrations of HMMs with state-dependent Wiener filters. 



where R xx , r xx and Pxxif) denote the autocorrelation matrix, the 
autocorrelation vector and the power-spectral functions respectively. The 
implementation of the Wiener filter, Equation (5.56), requires the signal and 
the noise power spectra. The power- spectral variables may be obtained from 
the ML states of the HMMs trained to model the power spectra of the signal 
and the noise. Figure 5.14 illustrates an implementation of HMM-based 
state-dependent Wiener filters. To implement the state-dependent Wiener 
filter, we need an estimate of the state sequences for the signal and the 
noise. In practice, for signals such as speech there are a number of HMMs; 
one HMM per word, phoneme, or any other elementary unit of the signal. In 
such cases it is necessary to classify the signal, so that the state-based 
Wiener filters are derived from the most likely HMM. Furthermore the noise 
process can also be modelled by an HMM. Assuming that there are V 
HMMs {fMj, ..., fMy} for the signal process, and one HMM for the noise, the 

state-based Wiener filter can be implemented as follows: 
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Step 1: Combine the signal and noise models to form the noisy signal 
models. 

Step 2: Given the noisy signal, and the set of combined noisy signal 
models, obtain the ML combined noisy signal model. 

Step 3: From the ML combined model, obtain the ML state sequence of 
speech and noise. 

Step 4: Use the ML estimate of the power spectra of the signal and the 
noise to program the Wiener filter Equation (5.56). 

Step 5: Use the state-dependent Wiener filters to filter the signal. 



5.7.1 Modelling Noise Characteristics 

The implicit assumption in using an HMM for noise is that noise statistics 
can be modelled by a Markovian chain of N different stationary processes. 
A stationary noise process can be modelled by a single-state HMM. For a 
non-stationary noise, a multi-state HMM can model the time variations of 
the noise process with a finite number of quasi- stationary states. In general, 
the number of states required to accurately model the noise depends on the 
non-stationary character of the noise. 

An example of a non-stationary noise process is the impulsive noise of 
Figure 5.15. Figure 5.16 shows a two-state HMM of the impulsive noise 
sequence where the state So models the “off’ periods between the impulses 
and the state Si models an impulse. In cases where each impulse has a well- 
defined temporal structure, it may be beneficial to use a multistate HMM to 
model the pulse itself. HMMs are used in Chapter 12 for modelling 
impulsive noise, and in Chapter 15 for channel equalisation. 



5.8 Summary 

HMMs provide a powerful method for the modelling of non-stationary 
processes such as speech, noise and time-varying channels. An HMM is a 
Bayesian finite- state process, with a Markovian state prior, and a state 
likelihood function that can be either a discrete density model or a 
continuous Gaussian pdf model. The Markovian prior models the time 
evolution of a non-stationary process with a chain of stationary sub- 
processes. The state observation likelihood models the space of the process 
within each state of the HMM. 
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Figure 5.15 Impulsive noise. 



a = a 
12 




In Section 5.3, we studied the Baum-Welch method for the training of 
the parameters of an HMM to model a given data set, and derived the 
forward-backward method for efficient calculation of the likelihood of an 
HMM given an observation signal. In Section 5.4, we considered the use of 
HMMs in signal classification and in the decoding of the underlying state 
sequence of a signal. The Yiterbi algorithm is a computationally efficient 
method for estimation of the most likely sequence of an HMM. Given an 
unlabelled observation signal, the decoding of the underlying state sequence 
and the labelling of the observation with one of number of candidate HMMs 
are accomplished using the Viterbi method. In Section 5.5, we considered 
the use of HMMs for MAP estimation of a signal observed in noise, and 
considered the use of HMMs in implementation of state-based Wiener filter 
sequence. 
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WIENER FILTERS 



6.1 Wiener Filters: Least Square Error Estimation 

6.2 Block-Data Formulation of the Wiener Filter 

6.3 Interpretation of Wiener Filters as Projection in Vector Space 

6.4 Analysis of the Least Mean Square Error Signal 

6.5 Formulation of Wiener Filters in the Frequency Domain 

6.6 Some Applications of Wiener Filters 

6.7 The Choice of Wiener Filter Order 

6.8 Summary 

W iener theory, formulated by Norbert Wiener, forms the 
foundation of data-dependent linear least square error filters. 
Wiener filters play a central role in a wide range of applications 
such as linear prediction, echo cancellation, signal restoration, channel 
equalisation and system identification. The coefficients of a Wiener filter 
are calculated to minimise the average squared distance between the filter 
output and a desired signal. In its basic form, the Wiener theory assumes 
that the signals are stationary processes. However, if the filter coefficients 
are periodically recalculated for every block of N signal samples then the 
filter adapts itself to the average characteristics of the signals within the 
blocks and becomes block-adaptive. A block-adaptive (or segment 
adaptive) filter can be used for signals such as speech and image that may 
be considered almost stationary over a relatively small block of samples. In 
this chapter, we study Wiener filter theory, and consider alternative 
methods of formulation of the Wiener filter problem. We consider the 
application of Wiener filters in channel equalisation, time-delay estimation 
and additive noise reduction. A case study of the frequency response of a 
Wiener filter, for additive noise reduction, provides useful insight into the 
operation of the filter. We also deal with some implementation issues of 
Wiener filters. 
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6.1 Wiener Filters: Least Square Error Estimation 

Wiener formulated the continuous-time, least mean square error, estimation 
problem in his classic work on interpolation, extrapolation and smoothing 
of time series (Wiener 1949). The extension of the Wiener theory from 
continuous time to discrete time is simple, and of more practical use for 
implementation on digital signal processors. A Wiener filter can be an 
infinite-duration impulse response (HR) filter or a finite-duration impulse 
response (FIR) filter. In general, the formulation of an HR Wiener filter 
results in a set of non-linear equations, whereas the formulation of an FIR 
Wiener filter results in a set of linear equations and has a closed-form 
solution. In this chapter, we consider FIR Wiener filters, since they are 
relatively simple to compute, inherently stable and more practical. The main 
drawback of FIR filters compared with HR filters is that they may need a 
large number of coefficients to approximate a desired response. 

Figure 6.1 illustrates a Wiener filter represented by the coefficient vector w. 
The filter takes as the input a signal y(m), and produces an output signal 
x(m ) , where x(m) is the least mean square error estimate of a desired or 

target signal x(m). The filter input-output relation is given by 



p - 1 

x(m) = ^w k y(m-k) 
k=0 



- W 



T 



y 



( 6 . 1 ) 



where m is the discrete-time index, y T = [>’(/«), y(m-l), ..., y(m-P- 1)] is the 
filter input signal, and the parameter vector w T =[w 0 , wq, ..., w P _ { J is the 

Wiener filter coefficient vector. In Equation (6.1), the filtering operation is 
expressed in two alternative and equivalent forms of a convolutional sum 
and an inner vector product. The Wiener filter error signal, e(m ) is defined 
as the difference between the desired signal x(m) and the filter output signal 
x(m) : 

e(m) — x(m) — x(m) 

, , T (6-2) 

= x(m)-w y 



In Equation (6.2), for a given input signal y(m) and a desired signal x(m), 
the filter error e(m ) depends on the filter coefficient vector w. 
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x(m) 



Figure 6.1 Illustration of a Wiener filter structure. 



To explore the relation between the filter coefficient vector w and the 
error signal e(m) we expand Equation (6.2) for N samples of the signals 
x(m) and y(m): 



f e(0) ^ 




r X(0) ^ 




' y(0) y(-i) y(-2) ... yd-P)^ 




r w o ' 


e{\) 




X(l) 




yd) y(0) y(-i) ... y( 2 -p) 




w l 


e(2) 

• 

• 




X(2) 

• 

• 


— 


y( 2) yd) y(0) ... y(3-P) 

• • • • • 

• • • • • 




w 2 

• 

• 


• 

^e(N- i) j 




• 

K X(N- 1) > 




• • • • • 

K y(N- 1) y(N- 2) y(N- 3) ... y(N-P) j 




• 

, W P- 1 y 



(6.3) 



In a compact vector notation this matrix equation may be written as 



e-x-Yw (6.4) 

where e is the error vector, x is the desired signal vector, Y is the input 
signal matrix and Yw =x is the Wiener filter output signal vector. It is 
assumed that the P initial input signal samples [v(— 1 ), . . ., y(-P - 1 )J are 
either known or set to zero. 

In Equation (6.3), if the number of signal samples is equal to the 
number of filter coefficients N=P, then we have a square matrix equation, 
and there is a unique filter solution w, with a zero estimation error e=0, such 
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that x = Yw = x . If N < P then the number of signal samples N is 
insufficient to obtain a unique solution for the filter coefficients, in this case 
there are an infinite number of solutions with zero estimation error, and the 
matrix equation is said to be underdetermined. In practice, the number of 
signal samples is much larger than the filter length N>P; in this case, the 
matrix equation is said to be overdetermined and has a unique solution, 
usually with a non-zero error. When N>P, the filter coefficients are 
calculated to minimise an average error cost function, such as the average 
absolute value of error E \\e(m)\\, or the mean square error E [e 2 (m)], where 

'£ [.] is the expectation operator. The choice of the error function affects the 

optimality and the computational complexity of the solution. 

In Wiener theory, the objective criterion is the least mean square error 
(LSE) between the filter output and the desired signal. The least square 
error criterion is optimal for Gaussian distributed signals. As shown in the 
followings, for FIR filters the LSE criterion leads to a linear and closed- 
form solution. The Wiener filter coefficients are obtained by minimising an 

average squared error function E[e (m)] with respect to the filter 
coefficient vector w. From Equation (6.2), the mean square estimation error 
is given by 

E[e 2 (m)] - E[(x(m) - w T y ) 2 ] 

= E[x 2 (m)]-2w T E[yx(m)]+w T E[yy T ]w ( 6 5 ) 

= r xx ( 6 ) — 2 w T r yx + w T R yy w 

where R yy =E \y(m)y J (m)\ is the autocorrelation matrix of the input signal 
and r xy =< Z [x{m)y(m)\ is the cross-correlation vector of the input and the 
desired signals. An expanded form of Equation (6.5) can be obtained as 



p - 1 p - 1 p - l 

E[e 2 (m)] = r xx (0) - 2^ w k r yx (k)+ £ w k £ Wj r yy (k - j) (6.6) 

k = 0 k = 0 7=0 

where r yy (k) and r yx (k) are the elements of the autocorrelation matrix R yy 
and the cross-correlation vector r xy respectively. From Equation (6.5), the 

mean square error for an FIR filter is a quadratic function of the filter 
coefficient vector w and has a single minimum point. For example, for a 
filter with only two coefficients (wq, wq), the mean square error function is a 
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Figure 6.2 Mean square error surface for a two-tap FIR filter. 



bowl-shaped surface, with a single minimum point, as illustrated in Figure 
6.2. The least mean square error point corresponds to the minimum error 
power. At this optimal operating point the mean square error surface has 
zero gradient. From Equation (6.5), the gradient of the mean square error 
function with respect to the filter coefficient vector is given by 



“E[e 2 (m)] = — 2‘E[x(m)y (m)]+2w’ T ‘£[y(m)y T (m)] 

d w 

=-2r yx +2w T R yy 



(6.7) 



where the gradient vector is defined as 

d d d d d 

dw _dw 0 <9 Wj dw 2 & w p-i_ 



The minimum mean square error Wiener filter is obtained by setting 
Equation (6.7) to zero: 



(6.9) 
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or, equivalently, 

w =R~y r yx (6.10) 

In an expanded form, the Wiener filter solution Equation (6.10) can be 
written as 



' w 0 ^ 




f r yy (0) r yy (l) r yy (2) ... r yy (P-i)^ 
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• • • • • 




r yx (2) 
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• 

, w >-l , 




/yy< p "l> r yyt p -V r yy (P-3) ... r yy (0) ^ 




• 

/yx( p ~'\ 



( 6 . 11 ) 

From Equation (6.11), the calculation of the Wiener filter coefficients 
requires the autocorrelation matrix of the input signal and the cross- 
correlation vector of the input and the desired signals. 

In statistical signal processing theory, the correlation values of a 
random process are obtained as the averages taken across the ensemble of 
different realisations of the process as described in Chapter 3. However in 
many practical situations there are only one or two finite-duration 
realisations of the signals x{m) and y(m). In such cases, assuming the signals 
are correlation-ergodic, we can use time averages instead of ensemble 
averages. For a signal record of length N samples, the time-averaged 
correlation values are computed as 

i N-l 

r yy(k) = —^y(m)y( m + k) (6.12) 

™ m = 0 

Note from Equation (6.11) that the autocorrelation matrix R yy has a highly 
regular Toeplitz structure. A Toeplitz matrix has constant elements along 
the left-right diagonals of the matrix. Furthermore, the correlation matrix is 
also symmetric about the main diagonal elements. There are a number of 
efficient methods for solving the linear matrix Equation (6.11), including 
the Cholesky decomposition, the singular value decomposition and the QR 
decomposition methods. 
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6.2 Block-Data Formulation of the Wiener Filter 

In this section we consider an alternative formulation of a Wiener filter for a 
block of N samples of the input signal [y(0), v( 1 ), ..., y(W-1 )] and the 
desired signal [v(0), x( 1 ), x(/V-l)]. The set of N linear equations 

describing the Wiener filter input/output relation can be written in matrix 
form as 



' x(0) N 
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y(-i) 
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r w o ^ 
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y{2 - P) 


Wj 
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y( 2) 
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y(X) 
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y(4-P) 

• 

• 


y(3 - P) 
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w 2 
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x(N - 2) 




• 

y(N- 2) 


• 

y(N- 3) 


• • 
y(N- 4) ... 
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y(N-P) 


• 

y(N-l-P) 


• 

Wp_2 






, y(N - 1) 


y(N - 2) 


y(N- 3) ... 


y(N + l-P) 


y(N-P) j 


y Wp - 1 y 



(6.13) 

Equation (6.13) can be rewritten in compact matrix notation as 

x=Yw (6.14) 



The Wiener filter error is the difference between the desired signal and the 
filter output defined as 



e — x - x 
- x - Yw 



(6.15) 



The energy of the error vector, that is the sum of the squared elements of 
the error vector, is given by the inner vector product as 



e T e = (x-Yw) T (x-Yw) 

r ■ i r ■ i r ■ i r 1 1 r 1 1 r i i 

=x T x-x t Yw-w t Y t x+w t Y t Yw 



(6.16) 



The gradient of the squared error function with respect to the Wiener filter 
coefficients is obtained by differentiating Equation (6.16): 




= -2x J Y +2 w T Y t Y 



(6.17) 
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The Wiener filter coefficients are obtained by setting the gradient of the 
squared error function of Equation (6.17) to zero, this yields 

(f t f)w=F t .*: (6.18) 

or 

w =(f t f)“V t x ( 6 . 19 ) 

Note that the matrix Y^Y is a time-averaged estimate of the autocorrelation 
matrix of the filter input signal R yy , and that the vector FEc is a time- 

averaged estimate of r the cross-correlation vector of the input and the 

desired signals. Theoretically, the Wiener filter is obtained from 
minimisation of the squared error across the ensemble of different 
realisations of a process as described in the previous section. For a 
correlation-ergodic process, as the signal length N approaches infinity the 
block-data Wiener filter of Equation (6.19) approaches the Wiener filter of 
Equation (6.10): 



lim 

N ~ >oo 




=R~ l r 

yyxy 



( 6 . 20 ) 



Since the least square error method described in this section requires a 
block of N samples of the input and the desired signals, it is also referred to 
as the block least square (BLS) error estimation method. The block 
estimation method is appropriate for processing of signals that can be 
considered as time-invariant over the duration of the block. 



6.2.1 QR Decomposition of the Least Square Error Equation 

An efficient and robust method for solving the least square error Equation 
(6.19) is the QR decomposition (QRD) method. In this method, the Nx P 
signal matrix F is decomposed into the product of an Nx N orthonormal 

matrix Q and a PxP upper-triangular matrix ^as 



QY= 



( 6 . 21 ) 
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T T 

where 0 is the (N - P) x P null matrix, Q L Q=QQ l = 1, and the upper- 
triangular matrix ^ is of the form 
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( 6 . 22 ) 



Substitution of Equation (6.21) in Equation (6.18) yields 









QQ 



vOy 






v<>, 



6 * 



From Equation (6.23) we have 



(<v \ 



v«. 



w—Qx 



From Equation (6.24) we have 



%W — Xq 



(6.23) 



(6.24) 



(6.25) 



where the vector Xq on the right hand side of Equation (6.25) is composed 
of the first P elements of the product Qx. Since the matrix K is upper- 
triangular, the coefficients of the least square error filter can be obtained 
easily through a process of back substitution from Equation (6.25), starting 
with the coefficient w P _ x =x Q (P - 1) / r P _ lP _ x . 

The main computational steps in the QR decomposition are the 
determination of the orthonormal matrix Q and of the upper triangular 

matrix The decomposition of a matrix into QR matrices can be achieved 
using a number of methods, including the Gram-Schmidt orthogonalisation 
method, the Householder method and the Givens rotation method. 
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Error 




Noisy 

signal 

y(m- 1) 
y(m- 2 ) 
y(m- 3 ) 



Figure 6.3 The least square error projection of a desired signal vector x onto a 
plane containing the input signal vectors y { andy 2 is the perpendicular projection 

of x shown as the shaded vector. 



6.3 Interpretation of Wiener Filters as Projection in Vector Space 

In this section, we consider an alternative formulation of Wiener filters 
where the least square error estimate is visualized as the perpendicular 
minimum distance projection of the desired signal vector onto the vector 
space of the input signal. A vector space is the collection of an infinite 
number of vectors that can be obtained from linear combinations of a 
number of independent vectors. 

In order to develop a vector space interpretation of the least square 
error estimation problem, we rewrite the matrix Equation (6.1 1) and express 
the filter output vector x as a linear weighted combination of the column 
vectors of the input signal matrix as 
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(6.26) 

In compact notation, Equation (6.26) may be written as 

x = Wo Jo + w i 3h + ■” + w p-\ y p-i (6.27) 

In Equation (6.27) the signal estimate x is a linear combination of P basis 
vectors [y 0 , Ji, • • •, Jp-iL and hence it can be said that the estimate x is in 

the vector subspace formed by the input signal vectors [y (J , jq, . . ., y P _ { J . 

In general, the P TV-dimensional input signal vectors [y 0 , y\, ■ ■ ■, Jp-il 

in Equation (6.27) define the basis vectors for a subspace in an TV- 
dimensional signal space. If P, the number of basis vectors, is equal to TV, 
the vector dimension, then the subspace defined by the input signal vectors 
encompasses the entire TV-dimensional signal space and includes the desired 
signal vector x. In this case, the signal estimate x=x and the estimation 

error is zero. However, in practice, N>P, and the signal space defined by 
the P input signal vectors of Equation (6.27) is only a subspace of the TV- 
dimensional signal space. In this case, the estimation error is zero only if 
the desired signal x happens to be in the subspace of the input signal, 
otherwise the best estimate of x is the perpendicular projection of the vector 
x onto the vector space of the input signal [y 0 , V |, . . ., as explained in 

the following example. 

Example 6.1 Figure 6.3 illustrates a vector space interpretation of a 
simple least square error estimation problem, where y T =[y(2), y( 1 ), y(0), y(- 
1)] is the input signal, x T =[x(2), x( 1 ), x(0)] is the desired signal and 
w T =[u'o, w\\ is the filter coefficient vector. As in Equation (6.26), the filter 

output can be written as 
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^x(2)^ 




^(2)" 




' y( i) " 
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= w 0 


y(i) 

U(°)J 


+ Wj 


y( 0) 

y(-l) 


(6.28) 



In Equation (6.28), the input signal vectors yj =[y(2), y( 1 ), y(0)J and 

y j =[y( 1 ), y(0), y(-l)] are 3-dimensional vectors. The subspace defined by 
the linear combinations of the two input vectors [yi, is a 2-dimensional 
plane in a 3-dimensional signal space. The filter output is a linear 
combination of jj and y 2 , and hence it is confined to the plane containing 

these two vectors. The least square error estimate of x is the orthogonal 
projection of x on the plane of [yi, y2] as shown by the shaded vector x. If 
the desired vector happens to be in the plane defined by the vectors jq and 
j 2 then the estimation error will be zero, otherwise the estimation error will 
be the perpendicular distance of x from the plane containing andjq- 



6.4 Analysis of the Least Mean Square Error Signal 

The optimality criterion in the formulation of the Wiener filter is the least 
mean square distance between the filter output and the desired signal. In 
this section, the variance of the filter error signal is analysed. Substituting 
the Wiener equation R yy w=r yx in Equation (6.5) gives the least mean square 
error: 



< E[e 2 (m)] = r xx (0) - w T r yx 

/fy. T n (6.29) 

= r xx (())-w R yy w 

Now, for zero-mean signals, it is easy to show that in Equation (6.29) the 
term w T R yy w is the variance of the Wiener filter output x(m ) : 

< 7 ? = “E[x 2 (m)] = w T R yy w (6.30) 



Therefore Equation (6.29) may be written as 
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2 2 2 ~2 2 2 

where cr ;c = , £[.x (m)] , o ~ =‘£[x (m)] and o e -E[e (m)] are the variances 

of the desired signal, the filter estimate of the desired signal and the error 
signal respectively. In general, the filter input y(m ) is composed of a signal 
component x c (m ) and a random noise n(m ): 

y(m)-x c (m)+n(m ) (6.32) 

where the signal x c (m ) is the part of the observation that is correlated with 

the desired signal x(m), and it is this part of the input signal that may be 
transformable through a Wiener filter to the desired signal. Using Equation 
(6.32) the Wiener filter error may be decomposed into two distinct 
components: 



e(m) = x(m)—'^' w k y(m — k) 

k = o 



x(m)- \w k x c (m-k) 
k = o 



(6.33) 



~Y, w k n ( m ~ k ) 

J k= 0 



or 



e(m)-e x (m)+e n (m) (6.34) 

where e x {m) is the difference between the desired signal x(m) and the output 
of the filter in response to the input signal component x c (m), i.e. 



p - 1 

e x {m)-x{m)-2_ l w k x c {m-k) (6.35) 

k = o 



and e n (m ) is the error in the output due to the presence of noise n(m) in the 
input signal: 

p - 1 

e n (m)=-^w k n(m-k) (6.36) 

k = o 

The variance of filter error can be rewritten as 

_2 _ _2 , _2 

® e ® e 

x n 



(6.37) 
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Note that in Equation (6.34), e x (m ) is that part of the signal that cannot be 

recovered by the Wiener filter, and represents distortion in the signal 
output, and e n (m ) is that part of the noise that cannot be blocked by the 

Wiener filter. Ideally, e x (m )= 0 and e n (m)= 0, but this ideal situation is 

possible only if the following conditions are satisfied: 

(a) The spectra of the signal and the noise are separable by a linear 
filter. 

(b) The signal component of the input, that is x c (m), is linearly 
transformable to x(m). 

(c) The filter length P is sufficiently large. The issue of signal and noise 
separability is addressed in Section 6.6. 



6.5 Formulation of Wiener Filters in the Frequency Domain 

y\ 

In the frequency domain, the Wiener filter output X(f) is the product of the 
input signal Y(f) and the filter frequency response W(f) : 



X(f)=W(f)Y(f) (6.38) 



The estimation error signal E(f) is defined as the difference between the 

yv 

desired signal X(f) and the filter output X(f), 



E(f) = X(f)-X(f) 

= X(f)-W(f)Y(f) 



(6.39) 



and the mean square error at a frequency /is given by 



E(f) 



= £ [(x(f)-W(f)Y(f)J{x(f)-W(f)Y(f)\ 



(6.40) 



where *£[•] is the expectation function, and the symbol * denotes the 
complex conjugate. Note from Parseval’s theorem that the mean square 
error in time and frequency domains are related by 
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N - 1 1/2 

2>V) = j\E(ffdf (6.41) 

m=0 -1/2 



To obtain the least mean square error filter we set the complex derivative of 
Equation (6.40) with respect to filter W(f) to zero 



<?£[! E(f) I 2 ] 

dW(f) 



= 2W (f)Pyy(f) ~ 2P xy (f) - 0 



(6.42) 



where Eyy(/)=‘£[T(/)F*(/)] and Pxy(f) =r P[X(f) F*(/)] are the power spectrum 
of Y(f), and the cross-power spectrum of Y(f) and X(f) respectively. From 
Equation (6.42), the least mean square error Wiener filter in the frequency 
domain is given as 



W(f) = 



PxAfl 

P Y y(f) 



(6.43) 



Alternatively, the frequency-domain Wiener filter Equation (6.43) can be 
obtained from the Fourier transform of the time-domain Wiener Equation 
(6.9): 

X iVjy =X r yI (n)e~““" (6.44) 

m k=0 m 



From the Wiener-Khinchine relation, the correlation and power-spectral 
functions are Fourier transform pairs. Using this relation, and the Fourier 
transform property that convolution in time is equivalent to multiplication 
in frequency, it is easy to show that the Wiener filter is given by Equation 
(6.43). 

6.6 Some Applications of Wiener Filters 

In this section, we consider some applications of the Wiener filter in 
reducing broadband additive noise, in time-alignment of signals in multi- 
channel or multisensor systems, and in channel equalisation. 
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Figure 6.4 Variation of the gain of Wiener filter frequency response with SNR. 

6.6.1 Wiener Filter for Additive Noise Reduction 

Consider a signal x{m) observed in a broadband additive noise n(m)., and 
model as 

y(m) — x(m) + n(m) (6.45) 

Assuming that the signal and the noise are uncorrelated, it follows that the 
autocorrelation matrix of the noisy signal is the sum of the autocorrelation 
matrix of the signal x{m) and the noise n(m): 

Ryy = Rxx "F Rnn (6.46) 

and we can also write 

Cty = Car (6-47) 

where R yy , R xx and R,„, are the autocorrelation matrices of the noisy signal, 
the noise-free signal and the noise respectively, and r xy is the cross- 
correlation vector of the noisy signal and the noise-free signal. Substitution 
of Equations (6.46) and (6.47) in the Wiener filter, Equation (6.10), yields 

W ~(Rxx Rnn ) ^ xx (6.48) 

Equation (6.48) is the optimal linear filter for the removal of additive noise. 
In the following, a study of the frequency response of the Wiener filter 
provides useful insight into the operation of the Wiener filter. In the 
frequency domain, the noisy signal Y(f) is given by 
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Figure 6.5 Illustration of the variation of Wiener frequency response with signal 
spectrum for additive white noise. The Wiener filter response broadly follows the 

signal spectrum. 



Y(f)=X(f)+N(f) (6.49) 



where X(f) and N(f) are the signal and noise spectra. For a signal observed 
in additive random noise, the frequency-domain Wiener filter is obtained as 



W(f) = 



PxAf) 

P*x(f) + P NN (f) 



(6.50) 



where Pxx(f) and Pnn{D are the signal and noise power spectra. Dividing 
the numerator and the denominator of Equation (6.50) by the noise power 
spectra PnnW and substituting the variable SNR(f) = Pxxif)! P nnW yields 



W(f) = 



SNRjf) 
SNR(f) + 1 



(6.51) 



where SNR is a signal-to-noise ratio measure. Note that the variable, SNR(f) 
is expressed in terms of the power- spectral ratio, and not in the more usual 
terms of log power ratio. Therefore SNR(f)=0 corresponds to — dB. 

From Equation (6.51), the following interpretation of the Wiener filter 
frequency response W(f) in terms of the signal-to-noise ratio can be 
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Magnitude 




(a) 



Magnitude 



Signal 

Noise 



Overlapped spectra 




Frequency 



(b) 



Frequency 



Figure 6.6 Illustration of separability: (a) The signal and noise spectra do not 
overlap, and the signal can be recovered by a low-pass filter; (b) the signal and 
noise spectra overlap, and the noise can be reduced but not completely removed. 



deduced. For additive noise, the Wiener filter frequency response is a real 
positive number in the range 0< W(f) < 1. Now consider the two limiting 

cases of (a) a noise-free signal SNR(f) = °° and (b) an extremely noisy 

signal SNR(f)=0. At very high SNR, W ( f )~ 1 , and the filter applies little or 

no attenuation to the noise-free frequency component. At the other extreme, 
when SNR(f)= 0, W(f)= 0. Therefore, for additive noise, the Wiener filter 
attenuates each frequency component in proportion to an estimate of the 
signal to noise ratio. Figure 6.4 shows the variation of the Wiener filter 
response W(f), with the signal-to-noise ratio SNR(f). 

An alternative illustration of the variations of the Wiener filter 
frequency response with SNR(f) is shown in Figure 6.5. It illustrates the 
similarity between the Wiener filter frequency response and the signal 
spectrum for the case of an additive white noise disturbance. Note that at a 
spectral peak of the signal spectrum, where the SNR(f) is relatively high, the 
Wiener filter frequency response is also high, and the filter applies little 
attenuation. At a signal trough, the signal-to-noise ratio is low, and so is the 
Wiener filter response. Hence, for additive white noise, the Wiener filter 
response broadly follows the signal spectrum. 

6.6.2 Wiener Filter and the Separability of Signal and Noise 

A signal is completely recoverable from noise if the spectra of the signal 
and the noise do not overlap. An example of a noisy signal with separable 
signal and noise spectra is shown in Figure 6.6(a). In this case, the signal 
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and the noise occupy different parts of the frequency spectrum, and can be 
separated with a low-pass, or a high-pass, filter. Figure 6.6(b) illustrates a 
more common example of a signal and noise process with overlapping 
spectra. For this case, it is not possible to completely separate the signal 
from the noise. However, the effects of the noise can be reduced by using a 
Wiener filter that attenuates each noisy signal frequency in proportion to an 
estimate of the signal-to-noise ratio as described by Equation (6.51). 



6.6.3 The Square-Root Wiener Filter 

A 

In the frequency domain, the Wiener filter output X(f) is the product of the 
input frequency X(f) and the filter response W(f) as expressed in Equation 
(6.38). Taking the expectation of the squared magnitude of both sides of 
Equation (6.38) yields the power spectrum of the filtered signal as 



£[IX(/)I 2 ] = |W(/)| 2 ‘£[IF(/)I 2 ] 

= \W(f)\ 2 P YY (f) 



(6.52) 



Substitution of W(f) from Equation (6.43) in Equation (6.52) yields 



£[li(/)l 2 ] = 



P XY (/) 

p YY (f) 



(6.53) 



Now, for a signal observed in an uncorrelated additive noise we have 



and 



P YY (f)-Pxx (f) +P NN (/) 



P XY (f)=Pxx(f) 



(6.54) 



(6.55) 



Substitution of Equations (6.54) and (6.55) in Equation (6.53) yields 



‘£[IX(/)i 2 ] = 



rh (/) 



Pxx (f ) + Pnn (f) 



(6.56) 



Now, in Equation (6.38) if instead of the Wiener filter, the square root of 
the Wiener filter magnitude frequency response is used, the result is 
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x(f)=mf)\ l,2 Y(f) 



(6.57) 



and the power spectrum of the signal, filtered by the square-root Wiener 
filter, is given by 



mX(f)\ 2 ]4w(f)\ l/2 Jn\Y(f)\ 2 ]=^^P YY (f)=P XY (f) (6.58) 



Pyy(f) 



Now, for uncorrelated signal and noise Equation (6.58) becomes 

n\X(f)\ 2 ]=P xx (f) (6.59) 

Thus, for additive noise the power spectrum of the output of the square-root 
Wiener filter is the same as the power spectrum of the desired signal. 



6.6.4 Wiener Channel Equaliser 

Communication channel distortions may be modelled by a combination of a 
linear filter and an additive random noise source as shown in Figure 6.7. 
The input/output signals of a linear time invariant channel can be modelled 
as 

p - 1 

y(m)-^h k x(m-k)+n(m) (6.60) 

k = o 



where x(m) and y(m ) are the transmitted and received signals, [h/J is the 
impulse response of a linear filter model of the channel, and n{m) models 
the channel noise. In the frequency domain Equation (6.60) becomes 

Y(f)=X(f)H(f)+N(f) (6.61) 

where X(f), Y(f), H(f) and N(f) are the signal, noisy signal, channel and noise 
spectra respectively. To remove the channel distortions, the receiver is 
followed by an equaliser. The equaliser input is the distorted channel 
output, and the desired signal is the channel input. Using Equation (6.43) it 
is easy to show that the Wiener equaliser in the frequency domain is given 
by 
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x(m) 



O 



noise n(m) 




Figure 6.7 Illustration of a channel model followed by an equaliser. 



W( f)= 

P xx (f)\H(f)\ 

+ ?NN if) 



(6.62) 



where it is assumed that the channel noise and the signal are uncorrelated. 
In the absence of channel noise, / 3 / v J v(/) = 0, and the Wiener filter is simply 

the inverse of the channel filter model W(f)=H~ 1 (f). The equalisation 
problem is treated in detail in Chapter 15. 



6.6.5 Time-Alignment of Signals in Multichannel/Multisensor 
Systems 

In multichannel/multisensor signal processing there are a number of noisy 
and distorted versions of a signal x(m), and the objective is to use all the 
observations in estimating x(m), as illustrated in Figure 6.8, where the phase 
and frequency characteristics of each channel is modelled by a linear filter. 
As a simple example, consider the problem of time-alignment of two noisy 
records of a signal given as 

>’] (m)-x(m)+n 1 (m) (6.63) 

y 2 (m)=Ax(m - D)+n 2 (m) (6.64) 

where y\(m) and yAm) are the noisy observations from channels 1 and 2, 
fti(m) and n 2 (m) are uncorrelated noise in each channel, D is the time delay 

of arrival of the two signals, and A is an amplitude scaling factor. Now 
assume that yi(m) is used as the input to a Wiener filter and that, in the 
absence of the signal x(m), y 2 (m) is used as the “desired” signal. The error 
signal is given by 
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Figure 6.8 Illustration of a multichannel system where Wiener filters are used to 

time-align the signals from different channels. 



P - 1 



e(m) = y 2 O) - X w k O) 



k=0 



P-1 



= Ax(m — D) ^ (m) + n 2 (m) 



k = o 



p-i 



£=0 



(6.65) 



The Wiener filter strives to minimise the terms shown inside the square 
brackets in Equation (6.65). Using the Wiener filter Equation (6.10), we 
have 



w=R y\yx r yxyi 

= { R xx +R n 1 n 1 ) ' Ar xx ^ D ) 



( 6 . 66 ) 



where r xx (D)= r E [x(PD)x(m)\. The frequency-domain equivalent of 
Equation (6.65) can be derived as 



w(f)= SakO Ae -i«o 

R XX (/) + R N,N, (/) 



(6.67) 



Note that in the absence of noise, the Wiener filter becomes a pure phase (or 
a pure delay) filter with a flat magnitude response. 
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Figure 6.9 Configuration of a system for estimation of frequency Wiener filter. 



6.6.6 Implementation of Wiener Filters 

The implementation of a Wiener filter for additive noise reduction, using 
Equations (6.48)-(6.50), requires the autocorrelation functions, or 
equivalently the power spectra, of the signal and noise. The noise power 
spectrum can be obtained from the signal-inactive, noise-only, periods. The 
assumption is that the noise is quasi-stationary, and that its power spectra 
remains relatively stationary between the update periods. This is a 
reasonable assumption for many noisy environments such as the noise 
inside a car emanating from the engine, aircraft noise, office noise from 
computer machines, etc. The main practical problem in the implementation 
of a Wiener filter is that the desired signal is often observed in noise, and 
that the autocorrelation or power spectra of the desired signal are not readily 
available. Figure 6.9 illustrates the block-diagram configuration of a system 
for implementation of a Wiener filter for additive noise reduction. An 
estimate of the desired signal power spectra is obtained by subtracting an 
estimate of the noise spectra from that of the noisy signal. A filter bank 
implementation of the Wiener filter is shown in Figure 6.10, where the 
incoming signal is divided into N bands of frequencies. A first-order 
integrator, placed at the output of each band-pass filter, gives an estimate of 
the power spectra of the noisy signal. The power spectrum of the original 
signal is obtained by subtracting an estimate of the noise power spectrum 
from the noisy signal. In a Bayesian implementation of the Wiener filter, 
prior models of speech and noise, such as hidden Markov models, are used 
to obtain the power spectra of speech and noise required for calculation of 
the filter coefficients. 
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Figure 6.10 A filter-bank implementation of a Wiener filter. 



6.7 The Choice of Wiener Filter Order 

The choice of Wiener filter order affects: 

(a) the ability of the filter to remove distortions and reduce the noise; 

(b) the computational complexity of the filter; and 

(c) the numerical stability of the of the Wiener solution, Equation 

( 6 . 10 ). 

The choice of the filter length also depends on the application and the 
method of implementation of the Wiener filter. For example, in a filter-bank 
implementation of the Wiener filter for additive noise reduction, the number 
of filter coefficients is equal to the number of filter banks, and typically the 
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number of filter banks is between 16 to 64. On the other hand for many 
applications, a direct implementation of the time-domain Wiener filter 
requires a larger filter length say between 64 and 256 taps. 

A reduction in the required length of a time-domain Wiener filter can 
be achieved by dividing the time domain signal into N sub-band signals. 
Each sub-band signal can then be decimated by a factor of N. The 
decimation results in a reduction, by a factor of N, in the required length of 
each sub-band Wiener filter. In Chapter 14, a subband echo canceller is 
described. 



6.8 Summary 

A Wiener filter is formulated to map an input signal to an output that is as 
close to a desired signal as possible. This chapter began with the derivation 
of the least square error Wiener filter. In Section 6.2, we derived the block- 
data least square error Wiener filter for applications where only finite- 
length realisations of the input and the desired signals are available. In such 
cases, the filter is obtained by minimising a time-averaged squared error 
function. In Section 6.3, we considered a vector space interpretation of the 
Wiener filters as the perpendicular projection of the desired signal onto the 
space of the input signal. 

In Section 6.4, the least mean square error signal was analysed. The 
mean square error is zero only if the input signal is related to the desired 
signal through a linear and invertible filter. For most cases, owing to noise 
and/or nonlinear distortions of the input signal, the minimum mean square 
error would be non-zero. In Section 6.5, we derived the Wiener filter in the 
frequency domain, and considered the issue of separability of signal and 
noise using a linear filter. Finally in Section 6.6, we considered some 
applications of Wiener filters in noise reduction, time-delay estimation and 
channel equalisation. 
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ADAPTIVE FILTERS 



7.1 State-Space Kalman Filters 

7.2 Sample-Adaptive Filters 

7.3 Recursive Least Square (RLS) Adaptive Filters 

7.4 The Steepest-Descent Method 

7.5 The LMS Filter 

7.6 Summary 



A daptive filters are used for non-stationary signals and 
environments, or in applications where a sample-by-sample 
adaptation of a process or a low processing delay is required. 
Applications of adaptive filters include multichannel noise reduction, 
radar/sonar signal processing, channel equalization for cellular mobile 
phones, echo cancellation, and low delay speech coding. This chapter 
begins with a study of the state-space Kalman filter. In Kalman theory a 
state equation models the dynamics of the signal generation process, and an 
observation equation models the channel distortion and additive noise. 
Then we consider recursive least square (RLS) error adaptive filters. The 
RLS filter is a sample-adaptive formulation of the Wiener filter, and for 
stationary signals should converge to the same solution as the Wiener filter. 
In least square error filtering, an alternative to using a Wiener-type closed- 
form solution is an iterative gradient-based search for the optimal filter 
coefficients. The steepest-descent search is a gradient-based method for 
searching the least square error performance curve for the minimum error 
filter coefficients. We study the steepest-descent method, and then consider 
the computationally inexpensive LMS gradient search method. 
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7.1 State-Space Kalman Filters 

The Kalman filter is a recursive least square error method for estimation of 
a signal distorted in transmission through a channel and observed in noise. 
Kalman filters can be used with time-varying as well as time-invariant 
processes. Kalman filter theory is based on a state-space approach in which 
a state equation models the dynamics of the signal process and an 
observation equation models the noisy observation signal. For a signal x(m ) 
and noisy observation y(m), the state equation model and the observation 
model are defined as 

x{m) =0(m,m - l)x(m - 1) + e(m) (7.1) 

y( m)- H(m)x(m) + n(m) (7.2) 

where 

x{m) is the /-’-dimensional signal, or the state parameter, vector at time m, 
&(tn, m- 1) is a Px P dimensional state transition matrix that relates the 
states of the process at times m-1 and m, 
e(m) is the P-dimensional uncorrelated input excitation vector of the state 
equation, 

E ee (m) is the P x P covariance matrix of e(m), 

y(m) is the M-dimensional noisy and distorted observation vector, 

Him ) is the M xP channel distortion matrix, 
n(m) is the M-dimensional additive noise process, 

Znn( m ) is the MxM covariance matrix of n(m). 

The Kalman filter can be derived as a recursive minimum mean square 
error predictor of a signal x(m), given an observation signal y(m). The filter 
derivation assumes that the state transition matrix &(m, m-1), the channel 
distortion matrix H(m), the covariance matrix Z ee (m) of the state equation 
input and the covariance matrix 2^ n (m) of the additive noise are given. 

In this chapter, we use the notation y(m\m — i) to denote a prediction of 
y(m) based on the observation samples up to the time m-i. Now assume that 
y{m\m- 1) j s the least square error prediction of y(m) based on the 
observations [y(0), ..., y(m- 1)]. Define a so-called innovation, or prediction 
error signal as 

m — l) 



v(m)—y(m) — y(m 



( 7 . 3 ) 
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Figure 7.1 Illustration of signal and observation models in Kalman filter theory. 



The innovation signal vector v(m ) contains all that is unpredictable from the 
past observations, including both the noise and the unpredictable part of the 
signal. For an optimal linear least mean square error estimate, the 
innovation signal must be uncorrelated and orthogonal to the past 
observation vectors; hence we have 

‘E\y(m)y T (m-k)]= 0, k> 0 (7.4) 

and 

£[v(m)v T (£)]= 0, m^k (7.5) 



The concept of innovations is central to the derivation of the Kalman filter. 
The least square error criterion is satisfied if the estimation error is 
orthogonal to the past samples. In the following derivation of the Kalman 
filter, the orthogonality condition of Equation (7.4) is used as the starting 
point to derive an optimal linear filter whose innovations are orthogonal to 
the past observations. 

Substituting the observation Equation (7.2) in Equation (7.3) and using 
the relation 

y(m I m - l)=E[y(m)\x(m\m - 1)] 

—H(m)x(m\m — l) (7-6) 



yields 



v(m) -H (m)x(m) + n(m) -H(m) x(m\m - 1) 
=H(m)x(m) + n(m) 



(7.7) 



where x(m) is the signal prediction error vector defined as 

x(m)=x(m) - x(m\m - 1) 



(7.8) 
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From Equation (7.7) the covariance matrix of the innovation signal is given 
by 

I vv (m) = ‘£[v(m)v T (m)] 

= H(m)I H (m)H T (m) + I nn (m) ^ 



where Z^{m) is the covariance matrix of the prediction error x(m) . Let 
x(m +l|m) denote the least square error prediction of the signal x(m+\). 
Now, the prediction of x(m+ 1), based on the samples available up to the 
time m, can be expressed recursively as a linear combination of the 
prediction based on the samples available up to the time m-1 and the 
innovation signal at time m as 



x(m + lm)= x(m + 1 m — 1 )+ K(m)v(m) 



(7.10) 



where the PxM matrix Kiwi ) is the Kalman gain matrix. Now, from 
Equation (7.1), we have 



x(m + bn-l)=0(m + l,m)x(mm-l) 



(7.11) 



Substituting Equation (7.11) in (7.10) gives a recursive prediction equation 
as 



x(m + 1 m)= 0{m + 1, m)x{m m — 1)+ K (m)v(m) 



(7.12) 



To obtain a recursive relation for the computation and update of the 
Kalman gain matrix, we multiply both sides of Equation (7.12) by v T (m) 
and take the expectation of the results to yield 




m + 1 



m 






T 

(m)J+ K (m)‘. £[v(m)v 




(7.13) 

Owing to the required orthogonality of the innovation sequence and the past 
samples, we have 



“E x(m\m-i)v T (m) =0 



(7.14) 



Hence, from Equations (7.13) and (7.14), the Kalman gain matrix is given 
by 



K{m )=£ 




m + 1 



m 





(7.15) 
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The first term on the right-hand side of Equation (7.15) can be expressed as 



l E\x(m + l|/n)»' T (/n)]=2i[(.*;(/n + 1 )—x{m + l|m))v T (m)] 

=‘E\x(m + l)v T (m)] 

= £[(<&(m + 1, m)x(m)+e(m + l))(y(m)—y(m\m — l)) T ] 

= £ [[0 (to + 1, m){x ( m\m — 1 ) + x ( m\m — 1))] (H (m)x(m\m — 1 ) + « (m) ) T ] 
= <P(m + 1, m)‘E\x(m\m — l)j£ T ( m\m — 1 )]// T (m) 

(7.16) 

In developing the successive lines of Equation (7.16), we have used the 
following relations: 

^[^(m + l lm)v T (m)]=0 



“E\e (m + l)(y (m)—y (m\ m- i)) T ]=o 



(7.17) 



(7.18) 



x(m) = x(m\m — l)+x(m\m- 1) 



(7.19) 



*£[x(m I m-l)x(ml in -!)] = () 



(7.20) 



and we have also used the assumption that the signal and the noise are 
uncorrelated. Substitution of Equations (7.9) and (7.16) in Equation (7.15) 
yields the following equation for the Kalman gain matrix: 



K(m)=0(i 



where is the covariance matrix of the signal prediction error 

x(m\m - 1) . To derive a recursive relation for Z fi .(m), we consider 

J\f 



x(mm — 



\)—x{m)- x(mm- 




(7.22) 



Substitution of Equation (7.1) and (7.12) in Equation (7.22) and 
rearrangement of the terms yields 

x(m\m — l)=\<P(m,m — l)x(m — 1) + e(m)]-[<P(m,m — \)x(m-\\m-2)+ K(m-\)v(m-\)\ 
= &(m,m-l)x(m-l)+ e(m) — K (m — l)H (m-l)x (m — 1) + K(m — l)n(m-l) 

= \0(m, m — l) — K(m — l)H(m — l)]x(m — l)+e(m) + K(m — l)n(m — l) 



(7.23) 
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From Equation (7.23) we can derive the following recursive relation for the 
variance of the signal prediction error 



Zxx (m)=L{m)Z zz (m - 1 )L T (m) + Z ee (m) + K(m- \)Z nn (m - 1 )K T (m - 1) 



where the Pxp matrix L(m) is defined as 



(7.24) 



L(m) = [0(m,m-l)-/sT(m — \)H(m — 1)] (7.25) 



Kalman Filtering Algorithm 




Input: observation vectors {y(m)} 
Output: state or signal vectors { x(m ) } 

Initial conditions: 




Z xx (0) = 


( 7.26) 


o 

II 

H 

1 


(7.27) 


For m = 0, 1, ... 
Innovation signal: 




v(m) = y (m) — H (m)x (m\m — 1) 


(7.28) 



Kalman gain: 

( m)=0{m + 1, m)X ^ (m)H T (m)[/7 {m)Z^ (m)H T (m) + Z nn (m)] 

(7.29) 

Prediction update: 

x(m + llm)= 0(m + l,m) x(m\m — 1)+ K(m)v(m) (7.30) 

Prediction error correlation matrix update: 

L(m + 1) = [0{m + 1 ,m)- K(m)H(m)\ (7.31) 

%xx ( m + 1 )=L(m + l)I H ( m)L(m + 1) T + E ee (m + \) + K ( m)E nn ( m)K ( m ) 

(7.32) 

Example 7.1 Consider the Kalman filtering of a first-order AR process 
x(m ) observed in an additive white Gaussian noise n(m). Assume that the 
signal generation and the observation equations are given as 



x(m)= a(m)x(m — 1) + e(m ) 



(7.33) 
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y(m)=x(m)+n(m) 


(7.34) 


Let <J 2 (m) and a 2 


(m) denote the variances of the excitation signal e(m) 


and the noise n{m) respectively. Substituting <P(m+ \ ,m)=a(m) and H(m)=\ 
in the Kalman filter equations yields the following Kalman filter algorithm: 


Initial conditions: 


o 2 (()) = 8 


(7.35) 




1 

II 

o 


(7.36) 


For m = 0, 1, ... 
Kalman gain: 


<J~ (m) + a 2 (m) 


(7.37) 


Innovation signal: 


v(m)=y(m)—x(m \m — 1) 


(7.38) 


Prediction signal update: 

x(m +11 m)= a(m + l)v(mlm - 1)+ k(m)v(m) 


(7.39) 



Prediction error update: 

<y 2 (m + 1) = [a(m + 1) -k(m)] 2 <y 2 ( m ) + o 2 (m + 1) + k 2 (m)o 2 (m) (7.40) 

where <J~ x (m) is the variance of the prediction error signal. 

Example 7.2 Recursive estimation of a constant signal observed^ in noise. 
Consider the estimation of a constant signal observed in a random noise. 
The state and observation equations for this problem are given by 

x(m)=x(m — l)= x (7.41) 

y(m)=x+n(m ) (7.42) 

Note that 1)=1, state excitation e(m)= 0 and H(m)= 1. Using the 

Kalman algorithm, we have the following recursive solutions: 



Initial Conditions: 



<7f(0) = <5 
x(0|- 1) = 0 



(7.43) 

(7.44) 
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For m = 0, 1, ... 
Kalman gain: 



k(m)= \ 


(7.45) 


o 7x (m) + o n (m) 




Innovation signal: 




v(m) = y(m)—x(m\ m — 1) 


(7.46) 


Prediction signal update: 




x{m + 1 1 m)—x(m 1 m - 1)+ k(m)v(m) 


( 7.47) 


Prediction error update: 




<7 ~(m + l)=[l - k(m)] 2 <7~ (m)+ k 2 (m)o 2 (m) 


(7.48) 



7.2 Sample-Adaptive Filters 

Sample adaptive filters, namely the RLS, the steepest descent and the LMS, 
are recursive formulations of the least square error Wiener filter. Sample- 
adaptive filters have a number of advantages over the block-adaptive filters 
of Chapter 6, including lower processing delay and better tracking of non- 
stationary signals. These are essential characteristics in applications such as 
echo cancellation, adaptive delay estimation, low-delay predictive coding, 
noise cancellation, radar, and channel equalisation in mobile telephony, 
where low delay and fast tracking of time-varying processes and 
environments are important objectives. 

Figure 7.2 illustrates the configuration of a least square error adaptive 
filter. At each sampling time, an adaptation algorithm adjusts the filter 
coefficients to minimise the difference between the filter output and a 
desired, or target, signal. An adaptive filter starts at some initial state, and 
then the filter coefficients are periodically updated, usually on a sample-by- 
sample basis, to minimise the difference between the filter output and a 
desired or target signal. The adaptation formula has the general recursive 
form: 

next parameter estimate = previous parameter estimate + update(error ) 

where the update term is a function of the error signal. In adaptive filtering a 
number of decisions has to be made concerning the filter model and the 
adaptation algorithm: 
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(a) Filter type: This can be a finite impulse response (FIR) filter, or an 
infinite impulse response (HR) filter. In this chapter we only consider 
FIR filters, since they have good stability and convergence properties 
and for this reason are the type most often used in practice. 

(b) Filter order: Often the correct number of filter taps is unknown. The 
filter order is either set using a priori knowledge of the input and the 
desired signals, or it may be obtained by monitoring the changes in the 
error signal as a function of the increasing filter order. 

(c) Adaptation algorithm: The two most widely used adaptation algorithms 
are the recursive least square (RLS) error and the least mean square 
error (LMS) methods. The factors that influence the choice of the 
adaptation algorithm are the computational complexity, the speed of 
convergence to optimal operating condition, the minimum error at 
convergence, the numerical stability and the robustness of the algorithm 
to initial parameter states. 



7.3 Recursive Least Square (RLS) Adaptive Filters 

The recursive least square error (RLS) filter is a sample-adaptive, time- 
update, version of the Wiener filter studied in Chapter 6. For stationary 
signals, the RLS filter converges to the same optimal filter coefficients as 
the Wiener filter. For non-stationary signals, the RLS filter tracks the time 
variations of the process. The RLS filter has a relatively fast rate of 
convergence to the optimal filter coefficients. This is useful in applications 
such as speech enhancement, channel equalization, echo cancellation and 
radar where the filter should be able to track relatively fast changes in the 
signal process. 

In the recursive least square algorithm, the adaptation starts with some 
initial filter state, and successive samples of the input signals are used to 
adapt the filter coefficients. Figure 7.2 illustrates the configuration of an 
adaptive filter where y(m), x(m) and w(m)=[wo(m), w\{m), ..., wp_i(m)] 

denote the filter input, the desired signal and the filter coefficient vector 
respectively. The filter output can be expressed as 



/v T 

x(m ) —w (m)y(m) 



(7.49) 
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“Desired” or “target ” 
signal x(m) 




Figure 7.2 Illustration of the configuration of an adaptive filter. 



where x(m) is an estimate of the desired signal x(m). The filter error signal 
is defined as 



e(m ) — x(m)—x(m ) 

T 

=x(m)— w ( m)y(m ) 



(7.50) 



The adaptation process is based on the minimization of the mean square 
error criterion defined as 



“E[e 2 (m)] = *E< x(m) - w 1 (m ) y (m) 



- “E[x 2 (m)] —2w T (m)£ [ y(m)x(m)] +ir T (m)£[ y(m) (m)] w(m ) 

= r xx (0) ~ 2w T ( m )fyx ( rn)+w T ( m)R yy (; m)w (m) 

(7.51) 

The Wiener filter is obtained by minimising the mean square error with 
respect to the filter coefficients. For stationary signals, the result of this 
minimisation is given in Chapter 6, Equation (6.10), as 



iv = R- y l y 




(7.52) 
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where R yy is the autocorrelation matrix of the input signal and r yx is the 
cross-correlation vector of the input and the target signals. In the following, 
we formulate a recursive, time-update, adaptive formulation of Equation 
(7.52). From Section 6.2, for a block of N sample vectors, the correlation 
matrix can be written as 

N - 1 

R yy =r T F = X y(m)y T (m ) (7.53) 

m = 0 



where y(m)=[y(m), ..., y(m-P)\ T . Now, the sum of vector product in 
Equation (7.53) can be expressed in recursive fashion as 

R yy (m) = R yy (m-1) + y(m)y T (m) (7.54) 

To introduce adaptability to the time variations of the signal statistics, the 
autocorrelation estimate in Equation (7.54) can be windowed by an 
exponentially decaying window: 

R yy (m) = XR yy (m- i) + y(m)y T (m) (7.55) 

where X is the so-called adaptation, or forgetting factor, and is in the range 
0>X> 1 . Similarly, the cross-correlation vector is given by 



N - 1 

r yx ~ y(m)x(m) (7.56) 

m = 0 



The sum of products in Equation (7.56) can be calculated in recursive form 
as 

r yx ( m ) = r yx (m - 1) +y(m)x(m ) (7.57) 

Again this equation can be made adaptive using an exponentially decaying 
forgetting factor X: 



r yx (m) = Xr yx (m-l)+y(m)x(m) (7.58) 

For a recursive solution of the least square error Equation (7.58), we need to 
obtain a recursive time-update formula for the inverse matrix in the form 
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Ryy (m)—Ryy (m - 1) + Updcite(m) (7.59) 

A recursive relation for the matrix inversion is obtained using the following 
lemma. 

The Matrix Inversion Lemma Let A and B be two positive-definite 
px p matrices related by 

A = B~ l +CD~ 1 C J (7.60) 



where D is a positive-definite Nx N matrix and C is a Px N matrix. The 
matrix inversion lemma states that the inverse of the matrix A can be 
expressed as 

A 1 =B-Bc{d + C t Bc)~ 1 C t B (7.61) 



This lemma can be proved by multiplying Equation (7.60) and Equation 
(7.61). The left and right hand sides of the results of multiplication are the 
identity matrix. The matrix inversion lemma can be used to obtain a 
recursive implementation for the inverse of the correlation matrix R~y(m). 

Let 



II 

csi 


(7.62) 


r l Ry X (m-l)=B 


(7.63) 


y(m) = C 


(7.64) 


D = identity matrix 


(7.65) 



Substituting Equations (7.62) and (7.63) in Equation (7.61), we obtain 



R~y y (m) =r x R- l y {m- 1 )- 



/L 2 Ryy (m - 1) y (m) y T (m) R yy (m - 1) 

l+rV(m)fl;](m-l)y(m) 



(7.66) 



Now define the variables <P(m) and k(m ) as 



<Pyy(m)=Ryy(m) 



(7.67) 
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and 



or 



k(m ) — 



r^m-l^m) 

1 +r 1 y T (m)R;j(m-l)y(m) 



A l O (m-l)y(m) 

K\T7l) — — — 

l+A _1 j T (m)& yy (m - 1 )y(m) 



(7.68) 



(7.69) 



Using Equations (7.67) and (7.69), the recursive equation (7.66) for 
computing the inverse matrix can be written as 



<P yy (m)= A l <P yy (m-l)-A 1 k (m) j T (m)& yy ( m - 1) (7.70) 



From Equations (7.69) and (7.70), we have 

k(m ) = [r 1 0 yy (m - 1) - A _1 A: (m)y T (m)& yy (m - l)]j(m) 
= & yy (m)y(m ) 



(7.71) 



Now Equations (7.70) and (7.71) are used in the following to derive the 
RLS adaptation algorithm. 



Recursive Time-update of Filter Coefficients The least square error 
filter coefficients are 



w(m) =R yy (m)r yx (m ) 
=&yy ( rn)r yx (m) 



(7.72) 



Substituting the recursive form of the correlation vector in Equation (7.72) 
yields 



w(m)= & yy (m)\Ar yx (m- 1) + y(m)x(m)\ 

- A0 yy (m)r yx (m-l)+ & yy (m)y(m)x(m) 



(7.73) 



Now substitution of the recursive form of the matrix O yy (m) from Equation 

(7.70) and k(m)=<P(m)y(m) from Equation (7.71) in the right-hand side of 
Equation (7.73) yields 
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w(m)=\X l 0 yy (m — l) — X l k(m)y T (m)& yy (m-l)J hr yx (m — l) + k(m)x(m) 

(7.74) 
or 

w(m)=0 yy ( m - 1 )r yx (m — 1 ) —k (m) j T (m)& yy ( m - 1 )r yx (m - 1 ) + k(m)x(m) 

(7.75) 

Substitution of w(m- \ )=<P(m- \ )r yx (m- \ ) in Equation (7.75) yields 



T 

w(m)= w{m — \)—k(m)\x{m)-y — 1 ) 



(7.76) 



This equation can be rewritten in the following form 

w(m)= w(m - l) — k(m)e(m) (7.77) 

Equation (7.77) is a recursive time-update implementation of the least 
square error Wiener filter. 

RLS Adaptation Algorithm 

Input signals: y(m) and x(m) 

Initial values: O yy (, m )= 81 

w (0)= w j 

For m = 1,2, ... 

Filter gain vector: 

(m-l)y(m) 

k{m) —= (7 78 ) 

l+X- 1 y T (m)0 yy (m-l)y(m) w ; 



Error signal equation: 

e(m)— x(m)-w T (m-l)y(m) ( 7 . 79 ) 

Filter coefficients: 

w(m)= w(m — 1) — k(m)e(m) (7.80) 



Inverse correlation matrix update: 

®yy(m)= At 1 Ar l k(m)y T (m)& yy (m — Y) 



(7.81) 
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Figure 7.3 Illustration of gradient search of the mean square error surface for the 

minimum error point. 



7.4 The Steepest-Descent Method 

The mean square error surface with respect to the coefficients of an FIR 
filter, is a quadratic bowl-shaped curve, with a single global minimum that 
corresponds to the LSE filter coefficients. Figure 7.3 illustrates the mean 
square error curve for a single coefficient filter. This figure also illustrates 
the steepest-descent search for the minimum mean square error coefficient. 
The search is based on taking a number of successive downward steps in 
the direction of negative gradient of the error surface. Starting with a set of 
initial values, the filter coefficients are successively updated in the 
downward direction, until the minimum point, at which the gradient is zero, 
is reached. The steepest-descent adaptation method can be expressed as 



w(m + \) — w{m) + jl 



dn e 2 (m)] 
d w{m) 



(7.82) 



where /i is the adaptation step size. From Equation (5.7), the gradient of the 
mean square error function is given by 
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<9‘£[<? 2 (m)] 
d w(m ) 



=-2r yx + 2R vv w(m ) 



yy 



(7.83) 



Substituting Equation (7.83) in Equation (7.82) yields 



w 



(m + l) = w(m) + H \r yx -R yy w(m)\ 



(7.84) 



where the factor of 2 in Equation (7.83) has been absorbed in the adaptation 
step size [i. Let w Q denote the optimal LSE filter coefficient vector, we 
define a filter coefficients error vector w(m ) as 



w(m) - w(m) - w 0 (7.85) 

For a stationary process, the optimal LSE filter w Q is obtained from the 
Wiener filter, Equation (5.10), as 

w 0 =R yy r yx (7.86) 

Subtracting w Q from both sides of Equation (7.84), and then substituting 
R yy w 0 for r , and using Equation (7.85) yields 

w{m + \)=\l-jlR yy \w(m) (7.87) 

It is desirable that the filter error vector w(m) vanishes as rapidly as 
possible. The parameter /i, the adaptation step size, controls the stability 
and the rate of convergence of the adaptive filter. Too large a value for /i 
causes instability; too small a value gives a low convergence rate. The 
stability of the parameter estimation method depends on the choice of the 
adaptation parameter fi and the autocorrelation matrix. From Equation 
(7.87), a recursive equation for the error in each individual filter coefficient 
can be obtained as follows. The correlation matrix can be expressed in 
terms of the matrices of eigenvectors and eigenvalues as 

Ryy = QAQ T 



(7.88) 
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Figure 7.4 A feedback model of the variation of coefficient error with time. 



where Q is an orthonormal matrix of the eigenvectors of R yy , and A is a 

diagonal matrix with its diagonal elements corresponding to the 
eigenvalues of R yy . Substituting R yy from Equation (7.88) in Equation 

(7.87) yields 

w(m + l)=[l-p(?A0 T ]w(m) (7.89) 



Multiplying both sides of Equation (7.89) by Q T and using the relation 
Q T Q=QQ T = I yields 



Q T w(m + l) = [l-flA]Q T w(m) (7.90) 

Let 



v(m) = Q T 



w(m ) 



(7.91) 



Then 



v(m+l)= [I — jltA]v(m) (7.92) 

As A and I are both diagonal matrices, Equation (7.92) can be expressed in 
terms of the equations for the individual elements of the error vector v(m) 
as 



v k (m + l)= [l-/ilX k ]v k (m) (7.93) 

where A/. is the /c th eigenvalue of the autocorrelation matrix of the filter 

input y(m). Figure 7.4 is a feedback network model of the time variations of 
the error vector. From Equation (7.93), the condition for the stability of the 
adaptation process and the decay of the coefficient error vector is 



1<1 -jU/l fc <l 



(7.94) 
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Let A m , ix denote the maximum eigenvalue of the autocorrelation matrix of 



y(m) then, from Equation (7.94) the limits on jj for stable adaptation are 
given by 



0<n< 



2 



A 



max 



(7.95) 



Convergence Rate The convergence rate of the filter coefficients 
depends on the choice of the adaptation step size n, where ()</.;< 1//L max . 

When the eigenvalues of the correlation matrix are unevenly spread, the 
filter coefficients converge at different speeds: the smaller the £ th 
eigenvalue the slower the speed of convergence of the fc th coefficients. The 
filter coefficients with maximum and minimum eigenvalues, A max and /L min 

converge according to the following equations: 

^max (^ + 1) — (1 — /^^max )^max (^) (7.96) 

^min (^ + 1) — (1 — /^^min )^min (^) (7.97) 



The ratio of the maximum to the minimum eigenvalue of a correlation 
matrix is called the eigenvalue spread of the correlation matrix: 



eigenvalue spread = 



A 



max 



A 



mm 



(7.98) 



Note that the spread in the speed of convergence of filter coefficients is 
proportional to the spread in eigenvalue of the autocorrelation matrix of the 
input signal. 



7.5 The LMS Filter 



The steepest-descent method employs the gradient of the averaged squared 
error to search for the least square error filter coefficients. A 
computationally simpler version of the gradient search method is the least 
mean square (LMS) filter, in which the gradient of the mean square error is 
substituted with the gradient of the instantaneous squared error function. 
The LMS adaptation method is defined as 
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w k (m+ 1) 



w ’(m + 1) = w '(m) + [l 



( d e 2 (m) X 



V 



d w(m ) 



(7.99) 



) 



where the error signal e{m) is given by 



e(m)—x(m ) — w 1 ( m)x(m ) 



(7.100) 



The instantaneous gradient of the squared error can be re-expressed as 



de 2 (m) 

dw(m) 



dw{m) 



[x(m)-w T (m)y(m)] 2 



— - 2y(m)[x(m)-w T (m)y(m)] 2 

— — 2 y{m)e(m) 



(7.101) 



Substituting Equation (7.101) into the recursion update equation of the filter 
parameters, Equation (7.99) yields the LMS adaptation equation: 

w{m + l) = w{m) + fl [y(m)e(m)] (7.102) 

It can seen that the filter update equation is very simple. The LMS filter is 
widely used in adaptive filter applications such as adaptive equalisation, 
echo cancellation etc. The main advantage of the LMS algorithm is its 
simplicity both in terms of the memory requirement and the computational 
complexity which is O(P), where P is the filter length. 
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Leaky LMS Algorithm The stability and the adaptability of the recursive 
LMS adaptation Equation (7.86) can improved by introducing a so-called 
leakage factor a as 

w(m + 1) =aw(m) + /a [y (m)e(m)] (7.103) 

Note that the feedback equation for the time update of the filter coefficients 
is essentially a recursive (infinite impulse response) system with input 
jiy(m)e(m) and its poles at a. When the parameter a< 1, the effect is to 
introduce more stability and accelerate the filter adaptation to the changes 
in input signal characteristics. 

Steady-State Error: The optimal least mean square error (LSE), E mi n , is 
achieved when the filter coefficients approach the optimum value defined 
by the block least square error equation w 0 —R~yr yx derived in Chapter 6. 

The steepest-decent method employs the average gradient of the error 
surface for incremental updates of the filter coefficients towards the optimal 
value. Hence, when the filter coefficients reach the minimum point of the 
mean square error curve, the averaged gradient is zero and will remain zero 
so long as the error surface is stationary. In contrast, examination of the 
LMS equation shows that for applications in which the LSE is non-zero 
such as noise reduction, the incremental update term jie(m)y(m) would 
remain non-zero even when the optimal point is reached. Thus at the 
convergence, the LMS filter will randomly vary about the LSE point, with 
the result that the LSE for the LMS will be in excess of the LSE for Wiener 
or steepest-descent methods. Note that at, or near, convergence, a gradual 
decrease in j .1 would decrease the excess LSE at the expense of some loss of 
adaptability to changes in the signal characteristics. 



7.6 Summary 

This chapter began with an introduction to Kalman filter theory. The 
Kalman filter was derived using the orthogonality principle: for the optimal 
filter, the innovation sequence must be an uncorrelated process and 
orthogonal to the past observations. Note that the same principle can also 
be used to derive the Wiener filter coefficients. Although, like the Wiener 
filter, the derivation of the Kalman filter is based on the least squared error 
criterion, the Kalman filter differs from the Wiener filter in two respects. 




Summary 



225 



First, the Kalman filter can be applied to non-stationary processes, and 
second, the Kalman theory employs a model of the signal generation 
process in the form of the state equation. This is an important advantage in 
the sense that the Kalman filter can be used to explicitly model the 
dynamics of the signal process. 

For many practical applications such as echo cancellation, channel 
equalisation, adaptive noise cancellation, time-delay estimation, etc., the 
RLS and LMS filters provide a suitable alternative to the Kalman filter. The 
RLS filter is a recursive implementation of the Wiener filter, and, for 
stationary processes, it should converge to the same solution as the Wiener 
filter. The main advantage of the LMS filter is the relative simplicity of the 
algorithm. However, for signals with a large spectral dynamic range, or 
equivalently a large eigenvalue spread, the LMS has an uneven and slow 
rate of convergence. If, in addition to having a large eigenvalue spread a 
signal is also non-stationary (e.g. speech and audio signals) then the LMS 
can be an unsuitable adaptation method, and the RLS method, with its 
better convergence rate and less sensitivity to the eigenvalue spread, 
becomes a more attractive alternative. 
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LINEAR PREDICTION MODELS 



8.1 Linear Prediction Coding 

8.2 Forward, Backward and Lattice Predictors 

8.3 Short-term and Long-Term Linear Predictors 

8.4 MAP Estimation of Predictor Coefficients 

8.5 Sub-Band Linear Prediction 

8.6 Signal Restoration Using Linear Prediction Models 

8.7 Summary 



L inear prediction modelling is used in a diverse area of applications, 
such as data forecasting, speech coding, video coding, speech 
recognition, model-based spectral analysis, model-based 
interpolation, signal restoration, and impulse/step event detection. In the 
statistical literature, linear prediction models are often referred to as 
autoregressive (AR) processes. In this chapter, we introduce the theory of 
linear prediction modelling and consider efficient methods for the 
computation of predictor coefficients. We study the forward, backward and 
lattice predictors, and consider various methods for the formulation and 
calculation of predictor coefficients, including the least square error and 
maximum a posteriori methods. For the modelling of signals with a quasi- 
periodic structure, such as voiced speech, an extended linear predictor that 
simultaneously utilizes the short and long-term correlation structures is 
introduced. We study sub-band linear predictors that are particularly useful 
for sub-band processing of noisy signals. Finally, the application of linear 
prediction in enhancement of noisy speech is considered. Further 
applications of linear prediction models in this book are in Chapter 1 1 on 
the interpolation of a sequence of lost samples, and in Chapters 12 and 13 
on the detection and removal of impulsive noise and transient noise pulses. 
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Figure 8.1 The concentration or spread of power in frequency indicates the 
predictable or random character of a signal: (a) a predictable signal; 

(b) a random signal. 

8.1 Linear Prediction Coding 

The success with which a signal can be predicted from its past samples 
depends on the autocorrelation function, or equivalently the bandwidth and 
the power spectrum, of the signal. As illustrated in Figure 8.1, in the time 
domain, a predictable signal has a smooth and correlated fluctuation, and in 
the frequency domain, the energy of a predictable signal is concentrated in 
narrow band/s of frequencies. In contrast, the energy of an unpredictable 
signal, such as a white noise, is spread over a wide band of frequencies. 

For a signal to have a capacity to convey information it must have a 
degree of randomness. Most signals, such as speech, music and video 
signals, are partially predictable and partially random. These signals can be 
modelled as the output of a filter excited by an uncorrelated input. The 
random input models the unpredictable part of the signal, whereas the filter 
models the predictable structure of the signal. The aim of linear prediction is 
to model the mechanism that introduces the correlation in a signal. 

Linear prediction models are extensively used in speech processing, in 
low bit-rate speech coders, speech enhancement and speech recognition. 
Speech is generated by inhaling air and then exhaling it through the glottis 
and the vocal tract. The noise-like air, from the lung, is modulated and 
shaped by the vibrations of the glottal cords and the resonance of the vocal 
tract. Figure 8.2 illustrates a source-filter model of speech. The source 
models the lung, and emits a random input excitation signal which is filtered 
by a pitch filter. 
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Pitch period 




Figure 8.2 A source-filter model of speech production. 



The pitch filter models the vibrations of the glottal cords, and generates a 
sequence of quasi-periodic excitation pulses for voiced sounds as shown in 
Figure 8.2. The pitch filter model is also termed the “long-term predictor” 
since it models the correlation of each sample with the samples a pitch 
period away. The main source of correlation and power in speech is the 
vocal tract. The vocal tract is modelled by a linear predictor model, which is 
also termed the “short-term predictor”, because it models the correlation of 
each sample with the few preceding samples. In this section, we study the 
short-term linear prediction model. In Section 8.3, the predictor model is 
extended to include long-term pitch period correlations. 

A linear predictor model forecasts the amplitude of a signal at time m, 
x(m), using a linearly weighted combination of P past samples [x(m— 1 ), 
x(m— 2), ..., x(m—P)] as 



p 

x(m)-^a k x(m-k ) (8.1) 

k=l 



where the integer variable m is the discrete time index, x(m) is the 
prediction of x(m), and are the predictor coefficients. A block-diagram 
implementation of the predictor of Equation (8.1) is illustrated in Figure 8.3. 

The prediction error e(m), defined as the difference between the actual 
sample value x(m) and its predicted value x(m ) , is given by 

e(m ) = x(m) — x(m) 

p 

— x(m) — V a k x(m - k ) 
k = 1 



( 8 . 2 ) 
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Figure 8.3 Block-diagram illustration of a linear predictor. 



For information-bearing signals, the prediction error e{m) may be regarded 
as the information, or the innovation, content of the sample x(m). From 
Equation (8.2) a signal generated, or modelled, by a linear predictor can be 
described by the following feedback equation 

p 

x(m ) = / , cu- x(m - k ) + e(m ) (8.3) 

k=\ 

Figure 8.4 illustrates a linear predictor model of a signal x(m). In this model, 
the random input excitation (i.e. the prediction error) is e(m)=Gu(m), where 
u(m) is a zero-mean, unit-variance random signal, and G, a gain term, is the 
square root of the variance of e(m ): 

G=(t[e 2 (m)]f /2 (8.4) 




Figure 8.4 Illustration of a signal generated by a linear predictive model. 
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Figure 8.5 The pole-zero position and frequency response of a linear predictor. 



where *£[•] is an averaging, or expectation, operator. Taking the "-transform 
of Equation (8.3) shows that the linear prediction model is an all-pole digital 
filter with ^-transfer function 




In general, a linear predictor of order P has PI 2 complex pole pairs, and can 
model up to PI 2 resonance of the signal spectrum as illustrated in Figure 8.5. 
Spectral analysis using linear prediction models is discussed in Chapter 9. 



8.1 .1 Least Mean Square Error Predictor 

The “best” predictor coefficients are normally obtained by minimising a 
mean square error criterion defined as 

( p 

‘E[e 2 (m)]=‘E x(m)—^ j a k x(m — k) 

\ k - 1 

P P P 

-E\x 2 (m)]-2^a k < E[x(m)x(m - k)]+^ a k ^ a • £[x(m - k)x(m - /)] 

k= 1 k - 1 j — 1 

= r x X (°)- 2r Ix a + aTR xx a 




( 8 . 6 ) 
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where R xx = '£[xx T J is the autocorrelation matrix of the input vector 

x T =[x(m- 1), x(m- 2), . . x(m-P)], r xx =‘E[x(m)x\ is the autocorrelation 

vector and a T =[a 1 , a 2 , . . ., a P \ is the predictor coefficient vector. From 

Equation (8.6), the gradient of the mean square prediction error with respect 
to the predictor coefficient vector a is given by 

E[e 2 (m )] =- 2r xx +2a T R xx (8.7) 



where the gradient vector is defined as 



d 

da 




( 8 . 8 ) 



The least mean square error solution, obtained by setting Equation (8.7) to 
zero, is given by 

R xx a=r xx (8.9) 

From Equation (8.9) the predictor coefficient vector is given by 

a=R xx r (8.10) 



Equation (8.10) may also be written in an expanded form as 



^ a l ^ 




^ XX (0) 


^ XX 


^XX 


r xx(P- v' 


-1 


^XX 


a 2 






^XX 


^XX 


r xx (P - 2) 




^xx (2) 


a 3 

• 

• 




^XX (2) 

• 

• 


^ XX 

• 

• 


r xx 

• • 

• • 


r xx (P - 

• 

• 




r xx (2) 

• 

• 


• 




• 

Jxx(P- D 


• 

r xx (P - 2) 


• • 

^XX (P ~3) 


• 

^XX ; 




• 

v ^XX J 



( 8 . 11 ) 



An alternative formulation of the least square error problem is as follows. 
For a signal block of N samples [x(0), ..., .i'(/V-l )J, we can write a set of N 
linear prediction error equations as 
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f e(0) ^ 




( JC(0) ^ 




( x(-l) jc( — 2) x(—3) ... x(-P) 3 


^ a 1 ^ 


e{\) 




V(l) 
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H 

'h 
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a i 


e(2) 
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— 


X(2) 
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x(l) JC(0) JC(-1) ... x(2 -P) 

• • • • • 

• • • • • 


a 3 

• 

• 


• 




• 

K X(N- 1), 




• • • • • 

x(N - 2) x(N - 3) x(N - 4) ... x(N-P-l) 


• 



t- j 

( 8 . 12 ) 



where x T = [x(-l), x(-P)\ is the initial vector. In a compact vector/matrix 
notation Equation (8.12) can be written as 

e = x - Xa (8.13) 



Using Equation (8.13), the sum of squared prediction errors over a block of 
N samples can be expressed as 



e T e=x T x-2 x T Xa-a 1 X T Xa 



(8.14) 



The least squared error predictor is obtained by setting the derivative of 
Equation (8.14) with respect to the parameter vector a to zero: 



^-^=-2x T X-a T X T X=0 

da 



(8.15) 



From Equation (8.15), the least square error predictor is given by 

a=(^ T ^)" 1 (^ T x) (8.16) 

A comparison of Equations (8.11) and (8.16) shows that in Equation (8.16) 
the autocorrelation matrix and vector of Equation (8.11) are replaced by the 
time-averaged estimates as 



, At— 1 

r xx ( m ) = — ^x{k)x{k-m) (8.17) 

N k = o 

Equations (8.11) and ( 8.16) may be solved efficiently by utilising the 
regular Toeplitz structure of the correlation matrix R xx . In a Toeplitz matrix, 
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all the elements on a left-right diagonal are equal. The correlation matrix is 
also cross-diagonal symmetric. Note that altogether there are only P+1 
unique elements [r xv (0), r 0 ( 1 ), . . . , r xx (P )] in the correlation matrix and the 

cross-correlation vector. An efficient method for solution of Equation (8.10) 
is the Levinson-Durbin algorithm, introduced in Section 8.2.2. 



8.1.2 The Inverse Filter: Spectral Whitening 



The all-pole linear predictor model, in Figure 8.4, shapes the spectrum of 
the input signal by transforming an uncorrelated excitation signal u(m) to a 
correlated output signal x(m). In the frequency domain the input-output 
relation of the all-pole filter of Figure 8.6 is given by 



X(f) = 



G U (/) 
A(f) 



E(f ) 

l-ta k e-W 

k - 1 



(8.18) 



where X(f), E(f) and U(f) are the spectra of x(m), e(m) and u(m) respectively, 
G is the input gain factor, and A(f) is the frequency response of the inverse 
predictor. As the excitation signal e(m) is assumed to have a flat spectrum, it 
follows that the shape of the signal spectrum X(j) is due to the frequency 
response 1 !A(f) of the all-pole predictor model. The inverse linear predictor, 




Figure 8.6 Illustration of the inverse (or whitening) filter. 
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as the name implies, transforms a correlated signal x(m ) back to an 
uncorrelated flat-spectrum signal e{m). The inverse filter, also known as the 
prediction error filter, is an all-zero finite impulse response filter defined as 

e(m ) = x(m) — x(m ) 

p 

= x(m)-^a k x(m-k ) (8.19) 

k - 1 

=(« inv ) T * 

where the inverse filter (a inv ) T =[1, —a\, . . -ap]=[ 1, -a], and x T =[x(m), 

x(m-P)]. The --transfer function of the inverse predictor model is given by 

p 

A(z ) = 1 - z ~ k (8.20) 

k = 1 



A linear predictor model is an all-pole filter, where the poles model the 
resonance of the signal spectrum. The inverse of an all-pole filter is an all- 
zero filter, with the zeros situated at the same positions in the pole-zero plot 
as the poles of the all-pole filter, as illustrated in Figure 8.7. Consequently, 
the zeros of the inverse filter introduce anti-resonances that cancel out the 
resonances of the poles of the predictor. The inverse filter has the effect of 
flattening the spectrum of the input signal, and is also known as a spectral 
whitening, or decorrelation, filter. 





Figure 8.7 Illustration of the pole-zero diagram, and the frequency responses of an 

all-pole predictor and its all-zero inverse filter. 
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8.1.3 The Prediction Error Signal 

The prediction error signal is in general composed of three components: 

(a) the input signal, also called the excitation signal; 

(b) the errors due to the modelling inaccuracies; 

(c) the noise. 

The mean square prediction error becomes zero only if the following 
three conditions are satisfied: (a) the signal is deterministic, (b) the signal is 
correctly modelled by a predictor of order P, and (c) the signal is noise-free. 
For example, a mixture of PI 2 sine waves can be modelled by a predictor of 
order P, with zero prediction error. However, in practice, the prediction 
error is nonzero because information bearing signals are random, often only 
approximately modelled by a linear system, and usually observed in noise. 
The least mean square prediction error, obtained from substitution of 
Equation (8.9) in Equation (8.6), is 

p 

£ (p) = ( E[e 2 (m)] = r xx (0) ~^a k r xx (k) (8.21) 

k - 1 



where E (P > denotes the prediction error for a predictor of order P. The 
prediction error decreases, initially rapidly and then slowly, with increasing 
predictor order up to the correct model order. For the correct model order, 
the signal e(m) is an uncorrelated zero-mean random process with an 
autocorrelation function defined as 



‘E[e(m)e(m - &)] 



o 2 =G 2 if m = k 
0 if m ^ k 



( 8 . 22 ) 



where o 2 is the variance of e(m). 



8.2 Forward, Backward and Lattice Predictors 

The forward predictor model of Equation (8.1) predicts a sample x(m ) from 
a linear combination of P past samples x(m-l), x(m- 2), . . .,x(m-P). 
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Figure 8.8 Illustration of forward and backward predictors. 



Similarly, as shown in Figure 8.8, we can define a backward predictor, that 
predicts a sample x(m-P) from P future samples x(m-P+ 1 ), . . x(m) as 

p 

x(m- P)=^c k x(m-k + 1) (8.23) 

k=l 



The backward prediction error is defined as the difference between the 
actual sample and its predicted value: 



b(m ) = x(m - P) - x(m — P) 

p 

= x(m — P)~2, c k x ( m ~ k + 1) 

k=l 



(8.24) 



From Equation (8.24), a signal generated by a backward predictor is given 
by 



p 

x(m- P)='S\c k x(m-k + i)+b(m) (8.25) 

k - 1 



The coefficients of the least square error backward predictor, obtained in a 
similar method to that of the forward predictor in Section 8.1.1, are given by 




238 



Linear Prediction Models 



r xx (°) 


r xx d) 


r xx (2) 


r xx<V 


r AA( 0 ) 


r xx<V 


r xA 2 ) 

• 

• 


^XX d) 

• 

• 


r xx (° ) 

• 

• 


• 

;« (p - 1) 


• 

^xx (P - 2) 


• 

^XX (P - 3) 



r xx (P - 1) ' 


( r \ 

c \ 




' rxx( p ) ' 


r xx (P- 2) 


c 2 




r xx (P- D 


r X x(P~ 3) 

• 

• 


c 3 

• 

• 


— 


r XX (P ~ 2) 

• 

• 


• 

^ XX J 


• 

K C P J 






C • 



(8.26) 



Note that the main difference between Equations (8.26) and (8.11) is that the 
correlation vector on the right-hand side of the backward predictor, Equation 
(8.26) is upside-down compared with the forward predictor, Equation 

(8.11) . Since the correlation matrix is Toeplitz and symmetric, Equation 

(8.11) for the forward predictor may be rearranged and rewritten in the 
following form: 



' r xx (° ) r xx d) r xx ( 2) ... r^p-ip 


{ dp ^ 
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(8.27) 



A comparison of Equations (8.27) and (8.26) shows that the coefficients of 
the backward predictor are the time-reversed versions of those of the 
forward predictor 
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-a 



B 
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(8.28) 



where the vector a B is the reversed version of the vector a . The relation 
between the backward and forward predictors is employed in the Levinson- 
Durbin algorithm to derive an efficient method for calculation of the 
predictor coefficients as described in Section 8.2.2. 
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8.2.1 Augmented Equations for Forward and Backward 
Predictors 



The inverse forward predictor coefficient vector is [1, —a \, ..., -ap\=[ 1, -« T J. 
Equations (8.1 1) and (8.21) may be combined to yield a matrix equation for 
the inverse forward predictor coefficients: 



(m 



.T. v 1 ) 



XX 
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XX J 



\ a ) 






V 



0 



j 



(8.29) 



Equation (8.29) is called the augmented forward predictor equation. 
Similarly, for the inverse backward predictor, we can define an augmented 
backward predictor equation as 
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V 



J 



(8.30) 



where rj x = [r xx (l),---,r xx (P)] and = [r xx (P),...,r xx (l )] . Note that the 

superscript BT denotes backward and transposed. The augmented forward 
and backward matrix Equations (8.29) and (8.30) are used to derive an 
order-update solution for the linear predictor coefficients as follows. 



8.2.2 Levinson-Durbin Recursive Solution 

The Levinson-Durbin algorithm is a recursive order-update method for 
calculation of linear predictor coefficients. A forward-prediction error filter 
of order i can be described in terms of the forward and backward prediction 
error filters of order i— 1 as 
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(8.31) 
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or in a more compact vector notation as 
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(8.32) 



where ki is called the reflection coefficient. The proof of Equation (8.32) and 
the derivation of the value of the reflection coefficient for ki follows shortly. 

Similarly, a backward prediction error filter of order i is described in terms 
of the forward and backward prediction error filters of order i - 1 as 
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(8.33) 



To prove the order-update Equation (8.32) (or alternatively Equation 
(8.33)), we multiply both sides of the equation by the (z' + l)x(/ + l) 

augmented matrix /?0 + 1 1 and use the equality 



to obtain 
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(8.35) 



where in Equation (8.34) and Equation (8.35) Ox = t r xx( l)vT xx (0], and 

Ox T ~ t r xx (0v " Tx* (1)] is the reversed version of Ox - Matrix-vector 

multiplication of both sides of Equation (8.35) and the use of Equations 
(8.29) and (8.30) yields 
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where 




(8.36) 



(8.37) 



If Equation (8.36) is true, it follows that Equation (8.32) must also be true. 
The conditions for Equation (8.36) to be true are 



and 

From (8.39), 



E «) =E H-V) +k . 

0=A (i ~ l) + k i E (i ~ l) 




A 

E 



O'-l) 



(i-l) 



(8.38) 

(8.39) 



(8.40) 



Substitution of A (i ~ ] > from Equation (8.40) into Equation (8.38) yields 



E (i) = E (i ~ l) (l-kh 



= £ (0) I1(W; 2 ) 

j = 1 



(8.41) 



Note that it can be shown that A (i > is the cross-correlation of the forward and 
backward prediction errors: 

A (l ~ ]) = E[b (i ~ 1) ( m - (m)] (8.42) 

The parameter A { '~ 1 ) is known as the partial correlation. 
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Durbin’s algorithm 

Equations (8.43)-(8.48) are solved recursively for /'= 1, . . P. The Durbin 
algorithm starts with a predictor of order zero for which E^=r xx ( 0). The 

algorithm then computes the coefficients of a predictor of order i, using the 
coefficients of a predictor of order i-l. In the process of solving for the 
coefficients of a predictor of order P, the solutions for the predictor 
coefficients of all orders less than P are also obtained: 



E m =rJ 0) 
For i =1, ..., P 


(8.43) 
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8.2.3 Lattice Predictors 

The lattice structure, shown in Figure 8.9, is a cascade connection of similar 
units, with each unit specified by a single parameter ki , known as the 

reflection coefficient. A major attraction of a lattice structure is its modular 
form and the relative ease with which the model order can be extended. A 
further advantage is that, for a stable model, the magnitude of ki is bounded 
by unity (Ik, I < 1), and therefore it is relatively easy to check a lattice 

structure for stability. The lattice structure is derived from the forward and 
backward prediction errors as follows. An order-update recursive equation 
can be obtained for the forward prediction error by multiplying both sides of 
Equation (8.32) by the input vector [. x(m ), x(m- 1), . . . , x(m-i)]: 



e^ l \m) — e 9 l \m) — kfb^ l \m- 1) 



(8.49) 
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Similarly, we can obtain an order-update recursive equation for the 
backward prediction error by multiplying both sides of Equation (8.33) by 
the input vector [x(m-i), x(m-i+ 1), . . . , x(m)\ as 

b ( '‘ } (m)=b ( ‘~ l} (m — l)-k i e ( ‘~ l) (m) (8.50) 

Equations (8.49) and (8.50) are interrelated and may be implemented by a 
lattice network as shown in Figure 8.8. Minimisation of the squared forward 
prediction error of Equation (8.49) over N samples yields 



N - 1 

{m)b^~^ ( m - 1 ) 

u _ m=0 

' N-\, ^ 

»=o (8.51) 




Figure 8.9 Configuration of (a) a lattice predictor and (b) the inverse lattice 

predictor. 
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Note that a similar relation for k,- can be obtained through minimisation of 
the squared backward prediction error of Equation (8.50) over N samples. 
The reflection coefficients are also known as the normalised partial 
correlation (PARCOR) coefficients. 



8.2.4 Alternative Formulations of Least Square Error Prediction 

The methods described above for derivation of the predictor coefficients are 
based on minimisation of either the forward or the backward prediction 
error. In this section, we consider alternative methods based on the 
minimisation of the sum of the forward and backward prediction errors. 



Burg's Method Burg’s method is based on minimisation of the sum of the 
forward and backward squared prediction errors. The squared error function 
is defined as 



E fb = X{t (0 ( m )] 2 + ^ 0) ( m >] 2 ] 

m = 0 



(8.52) 



Substitution of Equations (8.49) and (8.50) in Equation (8.52) yields 




2 



(8.53) 



j 7 (o 

Minimisation of ^ ft, with respect to the reflection coefficients k t yields 



N - 1 

2 (m — 1) 

m = 0 

]T [[f? ( ' -1) (m)] 2 + (m - 1)] 2 } 

m = 0 



(8.54) 
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Simultaneous Minimisation of the Backward and Forward 
Prediction Errors From Equation (8.28) we have that the backward 
predictor coefficient vector is the reversed version of the forward predictor 
coefficient vector. Hence a predictor of order P can be obtained through 
simultaneous minimisation of the sum of the squared backward and forward 
prediction errors defined by the following equation: 



= Y,{[e (P \m)] 2 +[b (P \m)f} 

m = 0 



N - 1 



m = 0 
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2 ’ 


x(m) — \a k x(m — k ) 
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x(m — P) — V a k x(m — P + k) 
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k - 1 




k = 1 





= (x - A'a ) T (x - Xa )+ (x B —X B a) (x B -^L B a) 



(8.55) 



where X and x are the signal matrix and vector defined by Equations (8.12) 
and (8.13), and similarly X B and x B are the signal matrix and vector for the 
backward predictor. Using an approach similar to that used in derivation of 
Equation (8.16), the minimisation of the mean squared error function of 
Equation (8.54) yields 



a=(z T X + X BT X B ) 1 (x T a: + X BT x B ) 



(8.56) 



Note that for an ergodic signal as the signal length N increases Equation 
(8.56) converges to the so-called normal Equation (8.10). 



8.2.5 Predictor Model Order Selection 

One procedure for the determination of the correct model order is to 
increment the model order, and monitor the differential change in the error 
power, until the change levels off. The incremental change in error power 
with the increasing model order from i— 1 to i is defined as 

AE {i) =E {i ~ i) -E (i) 



(8.57) 
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Figure 8.10 Illustration of the decrease in the normalised mean squared 
prediction error with the increasing predictor length for a speech signal. 



Figure 8.10 illustrates the decrease in the normalised mean square prediction 
error with the increasing predictor length for a speech signal. The order P 

beyond which the decrease in the error power AF (P > becomes less than a 
threshold is taken as the model order. 

In linear prediction two coefficients are required for modelling each 
spectral peak of the signal spectrum. For example, the modelling of a signal 
with K dominant resonances in the spectrum needs P=2K coefficients. 
Hence a procedure for model selection is to examine the power spectrum of 
the signal process, and to set the model order to twice the number of 
significant spectral peaks in the spectrum. 

When the model order is less than the correct order, the signal is under- 
modelled. In this case the prediction error is not well decorrelated and will 
be more than the optimal minimum. A further consequence of under- 
modelling is a decrease in the spectral resolution of the model: adjacent 
spectral peaks of the signal could be merged and appear as a single spectral 
peak when the model order is too small. When the model order is larger than 
the correct order, the signal is over-modelled. An over-modelled problem 
can result in an ill-conditioned matrix equation, unreliable numerical 
solutions and the appearance of spurious spectral peaks in the model. 
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8.3 Short-Term and Long-Term Predictors 

For quasi-periodic signals, such as voiced speech, there are two types of 
correlation structures that can be utilised for a more accurate prediction, 
these are: 

(a) the short-term correlation, which is the correlation of each sample 
with the P immediate past samples: x(m- 1), . . xirn-P ); 

(b) the long-term correlation, which is the correlation of a sample x(m) 
with say 2Q + 1 similar samples a pitch period T away: x(m-T+Q), . . ., 
x(m-T-Q). 

Figure 8.11 is an illustration of the short-term relation of a sample with the 
P immediate past samples and its long-term relation with the samples a 
pitch period away. The short-term correlation of a signal may be modelled 
by the linear prediction Equation (8.3). The remaining correlation, in the 
prediction error signal e(m), is called the long-term correlation. The long- 
term correlation in the prediction error signal may be modelled by a pitch 
predictor defined as 



Q 

e(m)= 'y'p k e(m — T — k ) 
k=-Q 



(8.58) 




Figure 8.1 1 Illustration of the short-term relation of a sample with the P immediate 
past samples and the long-term relation with the samples a pitch period away. 
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where p k are the coefficients of a long-term predictor of order 2Q+\ . The 

pitch period T can be obtained from the autocorrelation function of x(m ) or 
that of e{m)\ it is the first non-zero time lag where the autocorrelation 
function attains a maximum. Assuming that the long-term correlation is 
correctly modelled, the prediction error of the long-term filter is a 
completely random signal with a white spectrum, and is given by 



e(m)=e(m)— e(m) 

Q 

—e(m)— y'p k e(m — T — k ) 
k=-Q 



(8.59) 



Minimisation of £ [e 2 (m)] results in the following solution for the pitch 
predictor: 
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(8.60) 



An alternative to the separate, cascade, modelling of the short- and long- 
term correlations is to combine the short- and long-term predictors into a 
single model described as 

p Q 

x(m) — 'S'a k x(m — k) + I p k x(m-k-T)+e(m ) (8.61) 

k=\ k=-Q 

V v ' V ' 

short term prediction i ong term prediction 



In Equation (8.61), each sample is expressed as a linear combination of P 
immediate past samples and 2Q + 1 samples a pitch period away. 
Minimisation of ‘E\e 2 (m)] results in the following solution for the pitch 
predictor: 
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(8.62) 



In Equation (8.62), for simplicity the subscript xx of r xx (k) has been omitted. 
In Chapter 10, the predictor model of Equation (8.61) is used for 
interpolation of a sequence of missing samples. 



8.4 MAP Estimation of Predictor Coefficients 



The posterior probability density function of a predictor coefficient vector a, 
given a signal x and the initial samples x ( , can be expressed, using Bayes’ 



rule, as 



/a\X,X 1 ( a I x ’ x l)- 



fx\A,X, ( x I a ’ X l )fA\X 1 (' a ) 

fx\x, ( x 1*1 ) 



(8.63) 



In Equation (8.63), the pdfs are conditioned on P initial signal samples 
x x =[x(-P), x(-P+ 1 ), ..., x(— 1)]. Note that for a given set of samples [x, xj, 

fx lx, ( x 1*1 ) is a constant, and it is reasonable to assume that 

f A\X, ( fl ^*1 ') = f A (®) • 

8.4.1 Probability Density Function of Predictor Output 

The \)(\i fx\A,x^ x ^ a ’ x \) of the signal x, given the predictor coefficient vector a 
and the initial samples x x , is equal to the pdf of the input signal e: 

fx\A,x 1 (* \ a > x i) = fE (x~Xa) (8.64) 



where the input signal vector is given by 
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e=-Xa (8.65) 

and f E (e) is the pdf of e. Equation (8.64) can be expanded as 
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( 8 . 66 ) 



Assuming that the input excitation signal e(m) is a zero-mean, uncorrelated, 

Gaussian process with a variance of 07 , the likelihood function in Equation 
(8.64) becomes 



fx\A,x, ( x )=fs (x-Xa) 
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(8.67) 






An alternative form of Equation (8.67) can be obtained by rewriting 
Equation (8.66) in the following form: 
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( 8 . 68 ) 

In a compact notation Equation (8.68) can be written as 



e = Ax 



(8.69) 
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Using Equation (8.69), and assuming that the excitation signal e(m ) is a zero 
mean, uncorrelated process with variance a 2 e , the likelihood function of 
Equation (8.67) can be written as 
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(8.70) 






8.4.2 Using the Prior pdf of the Predictor Coefficients 

The prior pdf of the predictor coefficient vector is assumed to have a 
Gaussian distribution with a mean vector jd a and a covariance matrix X aa \ 
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(8.71) 



Substituting Equations (8.67) and (8.71) in Equation (8.63), the posterior 
pdf of the predictor coefficient vector f/wx.x, ( a I x ' x i ) can be expressed as 
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(8.72) 

The maximum a posteriori estimate is obtained by maximising the log- 
likelihood function: 
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(8.73) 



a MAP = (z aa X 1 X + (t/i) - ' Z aa X T x+o;(Z aa X T X+ /l„ (8.74) 
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Note that as the Gaussian prior tends to a uniform prior, the determinant 
covariance matrix X aa of the Gaussian prior increases, and the MAP solution 
tends to the least square error solution: 

a L5 =(A T A) _1 (A T x) (8.75) 

Similarly as the observation length N increases the signal matrix XJX 
becomes more significant than Z aa and again the MAP solution tends to a 

least squared error solution. 



8.5 Sub-Band Linear Prediction Model 

In a P th order linear prediction model, the P predictor coefficients model the 
signal spectrum over its full spectral bandwidth. The distribution of the LP 
parameters (or equivalently the poles of the LP model) over the signal 
bandwidth depends on the signal correlation and spectral structure. 
Generally, the parameters redistribute themselves over the spectrum to 
minimize the mean square prediction error criterion. An alternative to a 
conventional LP model is to divide the input signal into a number of sub- 
bands and to model the signal within each sub-band with a linear prediction 
model as shown in Figure 8.12. The advantages of using a sub-band LP 
model are as follows: 

(1) Sub-band linear prediction allows the designer to allocate a specific 
number of model parameters to a given sub-band. Different numbers 
of parameters can be allocated to different bands. 

(2) The solution of a full-band linear predictor equation, i.e. Equation 
(8.10) or (8.16), requires the inversion of a relatively large 
correlation matrix, whereas the solution of the sub-band LP models 
require the inversion of a number of relatively small correlation 
matrices with better numerical stability properties. For example, a 
predictor of order 18 requires the inversion of an 18x18 matrix, 
whereas three sub-band predictors of order 6 require the inversion of 
three 6x6 matrices. 

(3) Sub-band linear prediction is useful for applications such as noise 
reduction where a sub-band approach can offer more flexibility and 
better performance. 
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In sub-band linear prediction, the signal x(m ) is passed through a bank of N 
band-pass filters, and is split into N sub-band signals x k (m), k= 1, The 

k lh sub-band signal is modelled using a low-order linear prediction model as 

P k 

x k ( m )=X a k (0 x k ( m - 0+8 k e k ( m ) (8.76) 

i=\ 



where [a k , g^\ are the coefficients and the gain of the predictor model for the 

k th sub-band. The choice of the model order P k depends on the width of the 

sub-band and on the signal correlation structure within each sub-band. The 
power spectrum of the input excitation of an ideal LP model for the k th sub- 
band signal can be expressed as 



Pee(/ 




J k, start ^ J < Jk,end 



otherwise 



(8.77) 



where f k sum f k<en d are th e start and end frequencies of the k th sub-band 

signal. The autocorrelation function of the excitation function in each sub- 
band is a sine function given by 



r ee (m) = B k swc[m(B k -f k0 )/ 2] (8.78) 




LPC 

parameters 



Figure 8.12 Configuration of a sub-band linear prediction model. 
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where B k and f k0 are the bandwidth and the centre frequency of the /c th sub- 
band respectively. To ensure that each sub-band LP parameters only model 
the signal within that sub-band, the sub-band signals are down-sampled as 
shown in Figure 8.12. 



8.6 Signal Restoration Using Linear Prediction Models 



Linear prediction models are extensively used in speech and audio signal 
restoration. For a noisy signal, linear prediction analysis models the 
combined spectra of the signal and the noise processes. For example, the 
frequency spectrum of a linear prediction model of speech, observed in 
additive white noise, would be flatter than the spectrum of the noise-free 
speech, owing to the influence of the flat spectrum of white noise. In this 
section we consider the estimation of the coefficients of a predictor model 
from noisy observations, and the use of linear prediction models in signal 
restoration. The noisy signal y(m) is modelled as 



y(m) — x(m ) +n(m ) 

p 

— ja k x(m — k)+ e(m ) + n(m ) 
k - 1 



(8.79) 



where the signal x(m) is modelled by a linear prediction model with 
coefficients a k and random input e(m), and it is assumed that the noise n(m) 

is additive. The least square error predictor model of the noisy signal y(m) is 
given by 



R yy a=r yy (8.80) 

where R yy and r yy are the autocorrelation matrix and vector of the noisy 
signal y(m). For an additive noise model. Equation (8.80) can be written as 

( R xx +R nn )(«+«)= (r xx +r nn ) (8.81) 

where a is the error in the predictor coefficients vector due to the noise. A 
simple method for removing the effects of noise is to subtract an estimate of 
the autocorrelation of the noise from that of the noisy signal. The drawback 
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of this approach is that, owing to random variations of noise, correlation 
subtraction can cause numerical instability in Equation (8.80) and result in 
spurious solutions. In the following, we formulate the p.d.f. of the noisy 
signal and describe an iterative signal-restoration/parameter-estimation 
procedure developed by Lee and Oppenheim. 

From Bayes’ rule, the MAP estimate of the predictor coefficient vector 
a, given an observation signal vector y=[y(0), y(l), ..., y(iV-l)], and the 
initial samples vector x, is 



f AY. x, ( a I 



fy\A,X 1 (y I a ’ x l )f A, Xf ( a ’ X I ) 

fy,x l (y> x i) 



(8.82) 



Now consider the variance of the signal y in the argument of the term 
f Y\A,X I (y I a,Xj ) in Equation (8.82). The innovation of y(m) can be defined 

as 



p 

e(m) = y(m)-^a k y(m-k) 

k=\ 

P 

= e(m)+n(m)—2^,a k n(m — k) 

k - 1 



(8.83) 



The variance of y(m), given the previous P samples and the coefficient 
vector a, is the variance of the innovation signal £(m), given by 

p 

\ar[y (m)\ y (m - 1),. . . ,y (m - P), a] - <7 j +ol + ol - o 2 n ^ a \ (8.84) 

k - 1 

where Oy and (j} t are the variance of the excitation signal and the noise 

respectively. From Equation (8.84), the variance of y{m) is a function of the 
coefficient vector a. Consequently, maximisation of ./'m x (y\a, x ) with 

respect to the vector a is a non-linear and non-trivial exercise. 

Lim and Oppenheim proposed the following iterative process in which 
an estimate a of the predictor coefficient vector is used to make an estimate 
x of the signal vector, and the signal estimate x is then used to improve the 
estimate of the parameter vector a , and the process is iterated until 
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convergence. The posterior pdf of the noise-free signal x given the noisy 
signal y and an estimate of the parameter vector a is given by 



f X\A, y( x ^A) 



/xia (*!**) 

fv\A (Jl«) 



(8.85) 



Consider the likelihood term fY\A,x(y\a,x). Since the noise is additive, we 
have 



fy\A.x (yl<*,x)=f N (y ~x) 



1 



( 2 no 2 n ) 



N 12 



exp 



2(7 



1 <y-A:) T (y-A:) 



( 8 . 86 ) 



n 



Assuming that the input of the predictor model is a zero-mean Gaussian 
process with variance < 7 ^ , the pdf of the signal x given an estimate of the 

predictor coefficient vector a is 



fy\A,x ( x I®) 







r-p A rp A 

x A Ax 



\ 



j 



(8.87) 



where e - Ax as in Equation (8.69). Substitution of Equations (8.86) and 
(8.87) in Equation (8.85) yields 



fx\Aj{x\a,y)= 



1 



1 



fy\A (y I ® ) (2na n a e ) N 



■exp 



2a 



1 T I 'T' /v rp A 

—(y-x) (y-x) -x A Ax 



n 



2a 



( 8 . 88 ) 



In Equation (8.88), for a given signal y and coefficient vector a ,fy\A(y\a) is 
a constant. From Equation (8.88), the ML signal estimate is obtained by 
maximising the log-likelihood function as 
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d 



d 



f 



i ]n fx\A.y (xl«,j))= 
da dx 



I T 1 yv 'T' yv 

— - x T A T Ax 



1 



A 



V 



2(7 



la 



(y-x) (y-x) 



ho 



n 



J 



(8.89) 



which gives 



x =cr?((J?A T A + (Jgl) 1 



e \ n 



(8.90) 



The signal estimate of Equation (8.90) can be used to obtain an updated 
estimate of the predictor parameter. Assuming that the signal is a zero mean 
Gaussian process, the estimate of the predictor parameter vector a is given 
by 

a(f) = (A T A)” 1 (i T f) (8.91) 

Equations (8.90) and (8.91) form the basis for an iterative signal 
restoration/parameter estimation method. 



8.6.1 Frequency-Domain Signal Restoration Using Prediction 
Models 

The following algorithm is a frequency-domain implementation of the linear 
prediction model-based restoration of a signal observed in additive white 
noise. 

Initialisation'. Set the initial signal estimate to noisy signal x () =y , 

For iterations i = 0, 1, ... 

Step 1 Estimate the predictor parameter vector a. : 

d^x^xjx^'ixjx,) (8.92) 

Step 2 Calculate an estimate of the model gain G using the Parseval's 
theorem: 
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AM 

V t—t 



G 



N 



f = o 



p 

X 

/t=i 



1 -la ki e~ j27 ^ /N 



N—l 

= X y 2 (m)-N6 



2 

ft 



m=0 



(8.93) 



where <5 /. / are the coefficient estimates at iteration i, and N o} r is the 
energy of white noise over N samples. 

Step 3 Calculate an estimate of the power spectrum of speech model: 



,■(/)- 



G 



1-S 

k = 1 






- jljtfk/N 



(8.94) 



4 Calculate the Wiener filter frequency response: 



W t (f) = 



P X: X , , (/) 



r"! 









r ’ i 



(8.95) 



A A 

where (/) = cr„ is an estimate of the noise power spectrum. 



it 



Step 5 Filter the magnitude spectrum of the noisy speech as 



X i+ 1 



(8.96) 



Restore the time domain signal x ( - +1 by combining X M (f) with the 
phase of noisy signal and the complex signal to time domain. 

Step 6 Goto step 1 and repeat until convergence, or for a specified number 
of iterations. 

Figure 8.13 illustrates a block diagram configuration of a Wiener filter using 
a linear prediction estimate of the signal spectrum. Figure 8.14 illustrates the 
result of an iterative restoration of the spectrum of a noisy speech signal. 
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y(m)=x(m)+n(m) 

o— 



Speech 

activity 
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I 
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Noise estimator 



A 

-► x(m ) 



Figure 8.13 Iterative signal restoration based on linear prediction model of speech. 





Figure 8.14 Illustration of restoration of a noisy signal with iterative linear prediction 

based method. 



8.6.2 Implementation of Sub-Band Linear Prediction Wiener 
Filters 

Assuming that the noise is additive, the noisy signal in each sub-band is 
modelled as 



y k (m) = x k (m) + n k (m) (8.97) 

The Wiener filter in the frequency domain can be expressed in terms of the 
power spectra, or in terms of LP model frequency responses, of the signal 
and noise process as 
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W k (f) = 



P X,k(f ) 



2 

8 X,k 



A X ,k (/) 



Ay,k(f) 



2 

8y,k 



(8.98) 



where p x, k (f) and Py k (f) are the power spectra of the clean signal and the 

noisy signal for the k th subband respectively. From Equation (8.98) the 
square-root Wiener filter is given by 



Wl / 2 (f) 



8x,k 

A x ,k (/) 



AyjAf) 

8y,k 



(8.99) 



The linear prediction Wiener filter of Equation (8.99) can be implemented in 
the time domain with a cascade of a linear predictor of the clean signal, 
followed by an inverse predictor filter of the noisy signal as expressed by 
the following relations (see Figure 8.15): 

z k (m) = ^a xk ( i)z k ( m - i ) + — y k (m) (8. 100) 

i=l 8 y 

p 

x k (m)=^ a Yk (i)z k ( m -i) ( 8 . 101 ) 

i '=0 



where x k (m) is the restored estimate of x k (m) the clean speech signal and 
z k (m ) is an intermediate signal. 



Noisy 

signal 




Restored 

signal 



Figure 8.15 A cascade implementation of the LP squared-root Wiener filter. 
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8.7 Summary 

Linear prediction models are used in a wide range of signal processing 
applications from low-bit-rate speech coding to model-based spectral 
analysis. We began this chapter with an introduction to linear prediction 
theory, and considered different methods of formulation of the prediction 
problem and derivations of the predictor coefficients. The main attraction of 
the linear prediction method is the closed-form solution of the predictor 
coefficients, and the availability of a number of efficient and relatively 
robust methods for solving the prediction equation such as the Levinson- 
Durbin method. In Section 8.2, we considered the forward, backward and 
lattice predictors. Although the direct-form implementation of the linear 
predictor is the most convenient method, for many applications, such as 
transmission of the predictor coefficients in speech coding, it is 
advantageous to use the lattice form of the predictor. This is because the 
lattice form can be conveniently checked for stability, and furthermore a 
perturbation of the parameter of any section of the lattice structure has a 
limited and more localised effect. In Section 8.3, we considered a modified 
form of linear prediction that models the short-term and long-term 
correlations of the signal. This method can be used for the modelling of 
signals with a quasi-periodic structure such as voiced speech. In Section 8.4, 
we considered MAP estimation and the use of a prior pdf for derivation of 
the predictor coefficients. In Section 8.5, the sub-band linear prediction 
method was formulated. Finally in Section 8.6, a linear prediction model 
was applied to the restoration of a signal observed in additive noise. 
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POWER SPECTRUM AND CORRELATION 



9.1 Power Spectrum and Correlation 

9.2 Fourier Series: Representation of Periodic Signals 

9.3 Fourier Transform: Representation of Aperiodic Signals 

9.4 Non-Parametric Power Spectral Estimation 

9.5 Model-Based Power Spectral Estimation 

9.6 High Resolution Spectral Estimation Based on Subspace Eigen-Analysis 

9.7 Summary 



T he power spectrum reveals the existence, or the absence, of repetitive 
patterns and correlation structures in a signal process. These 
structural patterns are important in a wide range of applications such 
as data forecasting, signal coding, signal detection, radar, pattern 
recognition, and decision-making systems. The most common method of 
spectral estimation is based on the fast Fourier transform (FFT). For many 
applications, FFT-based methods produce sufficiently good results. 
However, more advanced methods of spectral estimation can offer better 
frequency resolution, and less variance. This chapter begins with an 
introduction to the Fourier series and transform and the basic principles of 
spectral estimation. The classical methods for power spectrum estimation 
are based on periodograms. Various methods of averaging periodograms, 
and their effects on the variance of spectral estimates, are considered. We 
then study the maximum entropy and the model-based spectral estimation 
methods. We also consider several high-resolution spectral estimation 
methods, based on eigen-analysis, for the estimation of sinusoids observed 
in additive white noise. 
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9.1 Power Spectrum and Correlation 

The power spectrum of a signal gives the distribution of the signal power 
among various frequencies. The power spectrum is the Fourier transform of 
the correlation function, and reveals information on the correlation structure 
of the signal. The strength of the Fourier transform in signal analysis and 
pattern recognition is its ability to reveal spectral structures that may be used 
to characterise a signal. This is illustrated in Figure 9. 1 for the two extreme 
cases of a sine wave and a purely random signal. For a periodic signal, the 
power is concentrated in extremely narrow bands of frequencies, indicating 
the existence of structure and the predictable character of the signal. In the 
case of a pure sine wave as shown in Figure 9.1(a) the signal power is 
concentrated in one frequency. For a purely random signal as shown in 
Figure 9.1(b) the signal power is spread equally in the frequency domain, 
indicating the lack of structure in the signal. 

In general, the more correlated or predictable a signal, the more 
concentrated its power spectrum, and conversely the more random or 
unpredictable a signal, the more spread its power spectrum. Therefore the 
power spectrum of a signal can be used to deduce the existence of repetitive 
structures or correlated patterns in the signal process. Such information is 
crucial in detection, decision making and estimation problems, and in 
systems analysis. 






Figure 9.1 The concentration/spread of power in frequency indicates the 
correlated or random character of a signal: (a) a predictable signal, (b) a 

random signal. 
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Figure 9.2 Fourier basis functions: (a) real and imaginary parts of a complex 
sinusoid, (b) vector representation of a complex exponential. 



9.2 Fourier Series: Representation of Periodic Signals 

The following three sinusoidal functions form the basis functions for the 
Fourier analysis: 



Xj (t) = cosft) 0 ? 


(9.1) 


x 2 (t) - sindu 0 t 


(9.2) 


x 3 (t) - cos co 0 t + j sin (Q 0 t - e 


(9.3) 



Figure 9.2(a) shows the cosine and the sine components of the complex 
exponential (cisoidal) signal of Equation (9.3), and Figure 9.2(b) shows a 
vector representation of the complex exponential in a complex plane with 
real (Re) and imaginary (Im) dimensions. The Fourier basis functions are 
periodic with an angular frequency of ft) 0 (rad/s) and a period of 

Tq=2k/o)q=1/Fq, where F 0 is the frequency (Hz). The following properties 

make the sinusoids the ideal choice as the elementary building block basis 
functions for signal analysis and synthesis: 

(i) Orthogonality: two sinusoidal functions of different frequencies 
have the following orthogonal property: 
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OO J OO J oo 

j sinl&y) sin (co 2 t) dt - — j cos(<y, + a> 2 ) dt 4 — j cos^ —(0 2 )dt- 0 



(9.4) 



For harmonically related sinusoids, the integration can be taken 
over one period. Similar equations can be derived for the product of 
cosines, or sine and cosine, of different frequencies. Orthogonality 
implies that the sinusoidal basis functions are independent and can 
be processed independently. For example, in a graphic equaliser, 
we can change the relative amplitudes of one set of frequencies, 
such as the bass, without affecting other frequencies, and in sub- 
band coding different frequency bands are coded independently and 
allocated different numbers of bits. 



(ii) Sinusoidal functions are infinitely differentiable. This is important, 
as most signal analysis, synthesis and manipulation methods 
require the signals to be differentiable. 

(iii) Sine and cosine signals of the same frequency have only a phase 
difference of 7 t /2 or equivalently a relative time delay of a quarter 
of one period i.e. T 0 /4. 



Associated with the complex exponential function e J0)i)l is a set of 
harmonically related complex exponentials of the form 

[l,e ±jc ° ot ,e ±j2c ° 0t ,e ±j3c ° ot ,...] (9.5) 

The set of exponential signals in Equation (9.5) are periodic with a 
fundamental frequency (O 0 =2tz/Tq=2kF 0 , where T () is the period and F 0 is the 

fundamental frequency. These signals form the set of basis functions for the 
Fourier analysis. Any linear combination of these signals of the form 



oo 

^c k e jka d (9.6) 

k =— oo 

is also periodic with a period T 0 . Conversely any periodic signal x(t) can be 

synthesised from a linear combination of harmonically related exponentials. 
The Fourier series representation of a periodic signal is given by the 
following synthesis and analysis equations: 
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oo 

x(t) = ^ c k e ’ k0)ot k = ■■■ - 1,0,1,. (synthesis equation) (9.7) 

k=—oo 



l V 2 

c k = — [ x(t)e j kco °‘ dt k = ...-1,0,1, ••• (analysis equation) (9.8) 

T ° -Tq/2 

The complex- valued coefficient c k conveys the amplitude (a measure of the 
strength) and the phase of the frequency content of the signal at kco 0 (Hz). 
Note from Equation (9.8) that the coefficient c k may be interpreted as a 

measure of the correlation of the signal x(t) and the complex exponential 

e ~jkco 0 t 

9.3 Fourier Transform: Representation of Aperiodic Signals 

The Fourier series representation of periodic signals consist of harmonically 
related spectral lines spaced at integer multiples of the fundamental 
frequency. The Fourier representation of aperiodic signals can be developed 
by regarding an aperiodic signal as a special case of a periodic signal with 
an infinite period. If the period of a signal is infinite then the signal does not 
repeat itself, and is aperiodic. 

Now consider the discrete spectra of a periodic signal with a period of 
Tq, as shown in Figure 9.3(a). As the period T 0 is increased, the fundamental 

frequency F () = 1 IT {) decreases, and successive spectral lines become more 

closely spaced. In the limit as the period tends to infinity (i.e. as the signal 
becomes aperiodic), the discrete spectral lines merge and form a continuous 
spectrum. Therefore the Fourier equations for an aperiodic signal (known as 
the Fourier transform) must reflect the fact that the frequency spectrum of an 
aperiodic signal is continuous. Hence, to obtain the Fourier transform 
relation, the discrete-frequency variables and operations in the Fourier series 
Equations (9.7) and (9.8) should be replaced by their continuous-frequency 
counterparts. That is, the discrete summation sign E should be replaced by 

the continuous summation integral J, the discrete harmonics of the 

fundamental frequency kF 0 should be replaced by the continuous frequency 
variable f and the discrete frequency spectrum c k should be replaced by a 
continuous frequency spectrum say X(f) . 
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Figure 9.3 (a) A periodic pulse train and its line spectrum, (b) A single pulse from 
the periodic train in (a) with an imagined “off” duration of infinity; its spectrum is 
the envelope of the spectrum of the periodic signal in (a). 



The Fourier synthesis and analysis equations for aperiodic signals, the so- 
called Fourier transform pair, are given by 



oo 



x(t)= \x(f)e j2 ^df 

— oo 


(9.9) 


oo 

X(f)= \x{t)e~ mt dt 


(9.10) 



— oo 



Note from Equation (9.10), that X(f )may be interpreted as a measure of 

• r\ jj-Tj 

the correlation of the signal x(t) and the complex sinusoid e J w . 

The condition for existence and computability of the Fourier transform 
integral of a signal x(t) is that the signal must have finite energy: 



oo 

[ \x(t)\ 2 dt <oo 



— oo 



(9.11) 
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■ 
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Discrete Fourier 
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AM /2Kkn 

x(k) = X x(m) e N 
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► X(0) 

► X(l) 
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► X(N - 2) 

► X(N- 1) 



Figure 9.4 Illustration of the DFT as a parallel-input, parallel-output processor. 



9.3.1 Discrete Fourier Transform (DFT) 

For a finite-duration, discrete-time signal x(m) of length N samples, the 
discrete Fourier transform (DFT) is defined as N uniformly spaced spectral 
samples 



X(k) = £ x(m)e K lKlN ) mk ? k = 0, . . N-l (9.12) 

m = 0 



(see Figure9.4). The inverse discrete Fourier transform (IDFT) is given by 

, N - 1 

x(m) = —'£X(k)e K2}tlN)mk , m = 0, . . N-l (9.13) 

N k= 0 

From Equation (9.13), the direct calculation of the Fourier transform 
requires N(N— 1) multiplications and a similar number of additions. 
Algorithms that reduce the computational complexity of the discrete Fourier 
transform are known as fast Fourier transforms (FFT) methods. FFT 

methods utilise the periodic and symmetric properties of e~' 2!rN to avoid 
redundant calculations. 

9.3.2 Time/Frequency Resolutions, The Uncertainty Principle 

Signals such as speech, music or image are composed of non-stationary (i.e. 
time-varying and/or space-varying) events. For example, speech is 
composed of a string of short-duration sounds called phonemes, and an 
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image is composed of various objects. When using the DFT, it is desirable 
to have high enough time and space resolution in order to obtain the spectral 
characteristics of each individual elementary event or object in the input 
signal. However, there is a fundamental trade-off between the length, i.e. the 
time or space resolution, of the input signal and the frequency resolution of 
the output spectrum. The DFT takes as the input a window of N uniformly 
spaced time-domain samples [x(0), x(l), ..., x(N—l)] of duration AT=N.T S , 

and outputs N spectral samples [WO), X( 1 ), ..., X(N- \ )] spaced uniformly 
between zero Hz and the sampling frequency F s =l/T s Hz. Hence the 

frequency resolution of the DFT spectrum Af, i.e. the space between 
successive frequency samples, is given by 





( 9 . 14 ) 



Note that the frequency resolution Af and the time resolution AT are 
inversely proportional in that they cannot both be simultanously increased; 

in fact, ATAf= 1 . This is known as the uncertainty principle. 

9.3.3 Energy-Spectral Density and Power-Spectral Density 

Energy, or power, spectrum analysis is concerned with the distribution of 
the signal energy or power in the frequency domain. For a deterministic 
discrete-time signal, the energy-spectral density is defined as 

|*(/)| 2 = 



oo 



y t x(m)e 



-jlTTfm 



m=—oo 



( 9 . 15 ) 



The energy spectrum of x(m) may be expressed as the Fourier transform of 
the autocorrelation function of x(m ): 



X(/)| 2 =X(/)X*(/) 

oo 

tn = oo 



( 9 . 16 ) 



where the variable r xx (m ) is the autocorrelation function of x(m). The 
Fourier transform exists only for finite-energy signals. An important 
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theoretical class of signals is that of stationary stochastic signals, which, as a 
consequence of the stationarity condition, are infinitely long and have 
infinite energy, and therefore do not possess a Fourier transform. For 
stochastic signals, the quantity of interest is the power-spectral density, 
defined as the Fourier transform of the autocorrelation function: 



oo 

Pxx(f)= 'L r ^ {m )e~ p - nSm (9.17) 

m=—oo 



where the autocorrelation function r xx (m ) is defined as 

r xx (m) — t E[x(m)x(m + k)] (9.18) 

In practice, the autocorrelation function is estimated from a signal record of 
length N samples as 



| N-\m\—\ 

rxx ( m )=—— — - y x(k)x(k + m), k= 0, . . N-l 
AMml k=Q 



(9.19) 



In Equation (9.19), as the correlation lag m approaches the record length N, 
the estimate of r xx (m ) is obtained from the average of fewer samples and 

has a higher variance. A triangular window may be used to “down-weight” 
the correlation estimates for larger values of lag m. The triangular window 
has the form 



w(m)=- 



1 - 



0 , 



m 

N 



m l < N - 1 



otherwise 



(9.20) 



Multiplication of Equation (9.19) by the window of Equation (9.20) yields 




N-\m\-\ 

y x(k)x(k + m ) 
k= 0 



(9.21) 



The expectation of the windowed correlation estimate r xx ( m ) is given by 
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i N—\m\—\ 

*£ [r (m)\ — — V “E \x(k)x(k + m )] 

N S o 



/ 



1- 



m 



A 



A 



(9.22) 



r xx (m) 



v / 



In Jenkins and Watts, it is shown that the variance of r xx (m) is given by 



Var [r xx (m)]= — ^ [r xx (k) + r xx (k- m)r xx ( k + m)] 

k=—o° 



(9.23) 



From Equations (9.22) and (9.23), r xx (m) is an asymptotically unbiased and 
consistent estimate. 



9.4 Non-Parametric Power Spectrum Estimation 

The classic method for estimation of the power spectral density of an N- 
sample record is the periodogram introduced by Sir Arthur Schuster in 1899. 
The periodogram is defined as 

2 

(9.24) 



The power- spectral density function, or power spectrum for short, defined in 
Equation (9.24), is the basis of non-parametric methods of spectral 
estimation. Owing to the finite length and the random nature of most 
signals, the spectra obtained from different records of a signal vary 
randomly about an average spectrum. A number of methods have been 
developed to reduce the variance of the periodogram. 

9.4.1 The Mean and Variance of Periodograms 

The mean of the periodogram is obtained by taking the expectation of 
Equation (9.24): 



Pxx (/) 



1 

N 



N - 1 
m = 0 



-jlnfm 



=Ax(/)i 2 
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nP xx 

i r N-\ N-i 

= — £ ^x(m)e- j27tfm ^x(n)e j2 ^ n (9.25) 

N _m = 0 n = 0 

N- if m \ 

= I 

As the number of signal samples N increases, we have 

oo 

lim £[P xx (/)]= J Zr xx (m)e- j2 « fin =P xx (f) (9.26) 

m=—oo 

For a Gaussian random sequence, the variance of the periodogram can be 
obtained as 



(9.27) 

As the length of a signal record N increases, the expectation of the 
periodogram converges to the power spectrum p xx (/) and the variance of 

A r\ 

P xx (f) converges to P£x G ) ■ Hence the periodogram is an unbiased but 

not a consistent estimate. The periodograms can be calculated from a DFT 
of the signal x(m), or from a DFT of the autocorrelation estimates r xx (m) . In 

addition, the signal from which the periodogram, or the autocorrelation 
samples, are obtained can be segmented into overlapping blocks to result in 
a larger number of periodograms, which can then be averaged. These 
methods and their effects on the variance of periodograms are considered in 
the following. 



9.4.2 Averaging Periodograms (Bartlett Method) 



Var[P xx (/)]=P^(/) 1 + 



sin2^V 
N sin 2rtf 



In this method, several periodograms, from different segments of a signal, 
are averaged in order to reduce the variance of the periodogram. The Bartlett 
periodogram is obtained as the average of K periodograms as 
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(9.28) 



^ (i) 

where P X x (/) is the periodogram of the r th segment of the signal. The 

A 

expectation of the Bartlett periodogram (/) is given by 



nPxx(f)]=np^(f)] 



N-l 



= i 

m=-(N- 1) 






1- 



m 



\ 



v 



TV 



y xx (m)e- j2 ^ m 



J 



1 1/2 

= - f fxx(v) 
^ - 1/2 



J1 



sin 7 r(/ —v)N 

L sin^(/-v) J 



dv 



(9.29) 



where (sin Tt^V / sin nf) 2 /N is the frequency response of the triangular 
window 1-lml/iV. From Equation (9.29), the Bartlett periodogram is 

asymptotically unbiased. The variance of P xx (/) is 1 /K of the variance of 
the periodogram, and is given by 



Var 




1 

K 



rix(f) 




r sin 2rtfN 
y N sin 




j 



(9.30) 



9.4.3 Welch Method: Averaging Periodograms from Overlapped 
and Windowed Segments 

In this method, a signal x(m), of length M samples, is divided into K 
overlapping segments of length N, and each segment is windowed prior to 
computing the periodogram. The /' th segment is defined as 

Xj(m ) = x(m + iD), m= 0, . . ,,A/-1, i= 0, . . ,,K - 1 (9.31) 

where D is the overlap. For half-overlap D=N/2 , while D=N corresponds to 
no overlap. For the / th windowed segment, the periodogram is given by 
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iV-l 

X w(m)Xj (m)e~ ,2n ^ n 
m = 0 



(9.32) 



where w(m) is the window function and U is the power in the window 
function, given by 

(9.33) 




The spectrum of a finite-length signal typically exhibits side-lobes due to 
discontinuities at the endpoints. The window function w(m) alleviates the 
discontinuities and reduces the spread of the spectral energy into the side- 
lobes of the spectrum. The Welch power spectrum is the average of K 
periodograms obtained from overlapped and windowed segments of a 
signal: 




(9.34) 



T J 7 

Using Equations (9.32) and (9.34), the expectation of P%x (/) can be 
obtained as 



where 



nplk (/)]=£ [£$(/)] 



i 

NU 

1 

NU 



N—lN—l 

X y w(n)w(m)‘E[x ; - (m)x ; - (n)\e~^ 2 ^ (n ~ l " ) 



n=0m=0 



N—l N-l 



X X w(n)w(m)r xx (n - m)e 



jW(n-m) 



n=0m=0 



1/2 

= jPxx(vW(y-f)dv 

- 1/2 

(9.35) 

(9.36) 



W(f) = 



1 

NU 



N-l 



X w(m)e 

m = 0 



-jlTTfm 



and the variance of the Welch estimate is given by 
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Var [/&(/)] = 



1 



K-\ K-l 




K- /'=o /=o 









,w 



(9.37) 



Welch has shown that for the case when there is no overlap, D=N, 



w Var[Pyy (/)] P 2 ( f) 

Var[P^ (/)]= — J - u 7 



K 



l 



K 



l 



(9.38) 



and for half-overlap, D=N/2 , 



Var l/&(/>]=-^-/&(/)] 

OA o 



(9.39) 



9.4.4 Blackman-Tukey Method 

In this method, an estimate of a signal power spectrum is obtained from the 
Fourier transform of the windowed estimate of the autocorrelation function 
as 



Pxx(f)= ^w(m)r xx (m)e j27 $ m (9.40) 

m =-(JV-l) 

For a signal of N samples, the number of samples available for estimation of 
the autocorrelation value at the lag m, r xx (m) , decrease as m approaches N. 

Therefore, for large m, the variance of the autocorrelation estimate 
increases, and the estimate becomes less reliable. The window w(m) has the 
effect of down-weighting the high variance coefficients at and around the 
end-points. The mean of the Blackman-Tukey power spectrum estimate is 

N - 1 

nPx ! (/)]= X ‘E[r xx (m)]w(m)e- j2j * n (9.41) 

m=-(N- 1 ) 

Now '£[ r xx ( in ) J —r xx ( m ) w B ( m ) , where w B (m) is the Bartlett, or triangular, 
window. Equation (9.41) may be written as 
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nPxx (/)]= X r xx {m)w c (m)e- j (9.42) 

m=-(N-l) 

where w c (m)= w B (m)w(m ) . The right-hand side of Equation (9.42) can be 

written in terms of the Fourier transform of the autocorrelation and the 
window functions as 

1/2 

W#/(/)]= \P XX (y)W c (f -v)dv (9.43) 

-1/2 

where W c (f) is the Fourier transform of w c (m). The variance of the 
Blackman-Tukey estimate is given by 

Vai (9.44) 

where U is the energy of the window w c {m). 

9.4.5 Power Spectrum Estimation from Autocorrelation of 
Overlapped Segments 

In the Blackman-Tukey method, in calculating a correlation sequence of 
length N from a signal record of length N, progressively fewer samples are 
admitted in estimation of r xx (m) as the lag m approaches the signal length 

N. Hence the variance of r xx ( m ) increases with the lag m. This problem can 

be solved by using a signal of length IN samples for calculation of N 
correlation values. In a generalisation of this method, the signal record x(m), 
of length M samples, is divided into a number K of overlapping segments of 
length 2 N. The I th segment is defined as 

Xj (m ) = x(m + iD), m = 0, 1, . . ., 22V— 1 (9.45) 

i = 0, 1, . . .,K- 1 

where D is the overlap. For each segment of length 2 N, the correlation 
function in the range of 0>m>N is given by 

i N - 1 

r xx (m ) — — 'y\xj(k)xj(k +m ), m = 0, 1, . . AM (9.46) 

N k=0 
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In Equation (9.46), the estimate of each correlation value is obtained as the 
averaged sum of N products. 



9.5 Model-Based Power Spectrum Estimation 

In non-parametric power spectrum estimation, the autocorrelation function 
is assumed to be zero for lags \m\> N , beyond which no estimates are 

available. In parametric or model-based methods, a model of the signal 
process is used to extrapolate the autocorrelation function beyond the range 
\m\< N for which data is available. Model-based spectral estimators have a 

better resolution than the periodograms, mainly because they do not assume 
that the correlation sequence is zero- valued for the range of lags for which 
no measurements are available. 

In linear model-based spectral estimation, it is assumed that the signal 
x(m ) can be modelled as the output of a linear time-invariant system excited 
with a random, flat-spectrum, excitation. The assumption that the input has 
a flat spectrum implies that the power spectrum of the model output is 
shaped entirely by the frequency response of the model. The input-output 
relation of a generalised discrete linear time-invariant model is given by 

p Q 

x(m)=^ i a k x(m-k)+^b k e(m-k) (9.47) 

k=\ k = 0 



where x(m) is the model output, e(m) is the input, and the and are the 
parameters of the model. Equation (9.47) is known as an auto-regressive- 
moving-average (ARMA) model. The system function H(z) of the discrete 
linear time-invariant model of Equation (9.47) is given by 



H(z)= 



B(z) 

Mz) 




k= 1 



(9.48) 



where 1 /A(z) and B(z) are the autoregressive and moving-average parts of 
H(z) respectively. The power spectrum of the signal x(m) is given as the 
product of the power spectrum of the input signal and the squared 
magnitude frequency response of the model: 
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P XX (f)-PEE(f)\H(f)\ (9.49) 

where H(f) is the frequency response of the model and Pee{D is the input 
power spectrum. Assuming that the input is a white noise process with unit 
variance, i.e. Pee(/) = 1> Equation (9.49) becomes 

P xx (f)=\H(f)\ 2 (9.50) 

Thus the power spectrum of the model output is the squared magnitude of 
the frequency response of the model. An important aspect of model-based 
spectral estimation is the choice of the model. The model may be an auto 
regressive (all-pole), a moving-average (all-zero) or an ARMA (pole-zero) 
model. 

9.5.1 Maximum-Entropy Spectral Estimation 

The power spectrum of a stationary signal is defined as the Fourier 
transform of the autocorrelation sequence: 

oo 

Pxx(f)= ^Lr xx (m)e-j l7 ^ m (9.51) 



Equation (9.51) requires the autocorrelation r xx (m) for the lag m in the range 
± oo In practice, an estimate of the autocorrelation r xx (m) is available only 
for the values of m in a finite range of say ±P. In general, there are an 
infinite number of different correlation sequences that have the same values 
in the range I m \<P I as the measured values. The particular estimate used 

in the non-parametric methods assumes the correlation values are zero for 
the lags beyond ±P, for which no estimates are available. This arbitrary 
assumption results in spectral leakage and loss of frequency resolution. The 
maximum- entropy estimate is based on the principle that the estimate of the 
autocorrelation sequence must correspond to the most random signal whose 
correlation values in the range \m\<P coincide with the measured values. 

The maximum-entropy principle is appealing because it assumes no more 
structure in the correlation sequence than that indicated by the measured 
data. The randomness or entropy of a signal is defined as 
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1/2 



^l/xx(/)] — Jl n ^xx(/)^ 

- 1/2 



(9.52) 



To obtain the maximum-entropy correlation estimate, we differentiate 
Equation (9.53) with respect to the unknown values of the correlation 
coefficients, and set the derivative to zero: 



dH[Pyy(f)] C dlnPyy(f) 

xxvju = f xx±ll df=0 for Iml > P (9.53) 

dr xx {m) _f /2 dr xx (m ) 

Now, from Equation (9.17), the derivative of the power spectrum with 
respect to the autocorrelation values is given by 



d^XX (/) _ - jlnfm 

d r xx (m) 



(9.54) 



From Equation (9.51), for the derivative of the logarithm of the power 
spectrum, we have 



gin Pxxif) 
d r xx (m) 



= PxUf)e~ j27tfm 



(9.55) 



Substitution of Equation (9.55) in Equation (9.53) gives 

1/2 

J ^xx (f)e~ i27 * m df =0 for Iml > P (9.56) 
- 1/2 



Assuming that P X x (/) is integrable, it may be associated with an 
autocorrelation sequence c(m) as 



oo 

Pxxif )= X c(m ) e ~^ (9.57) 

m =— oo 

where 

1/2 

cirri) = J Pxxif )e mm df 
- 1/2 



(9.58) 
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From Equations (9.56) and (9.58), we have c(m)=0 for Iml > P. Hence, from 
Equation (9.57), the inverse of the maximum-entropy power spectrum may 
be obtained from the Fourier transform of a finite-length autocorrelation 
sequence as 

p 

p xx(f) = Yj c ^ e ~ j27lfm (9-59) 

m—-P 



and the maximum-entropy power spectrum is given by 



Pxx E (f)=— p 

^c{m)e- j27 ^ m 



(9.60) 



Since the denominator polynomial in Equation (9.60) is symmetric, it 
follows that for every zero of this polynomial situated at a radius r, there is a 
zero at radius 1/r. Hence this symmetric polynomial can be factorised and 
expressed as 



^Tc(m)z m = — yA(z)A(z ! ) (9.61) 

m=—P ® 

where l/<7 2 is a gain term, and A(z) is a polynomial of order P defined as 

A(z)—l+ci^z * H 1 \-ctpZ p (9.62) 



From Equations (9.60) and (9.61), the maximum-entropy power spectrum 
may be expressed as 



p xx E (f ) = 



a 



A(z)A(z ! ) 



(9.63) 



Equation (9.63) shows that the maximum-entropy power spectrum estimate 
is the power spectrum of an autoregressive (AR) model. Equation (9.63) 
was obtained by maximising the entropy of the power spectrum with respect 
to the unknown autocorrelation values. The known values of the 
autocorrelation function can be used to obtain the coefficients of the AR 
model of Equation (9.63), as discussed in the next section. 




282 



Power Spectrum and Correlation 



9.5.2 Autoregressive Power Spectrum Estimation 

In the preceding section, it was shown that the maximum-entropy spectrum 
is equivalent to the spectrum of an autoregressive model of the signal. An 
autoregressive, or linear prediction model, described in detail in Chapter 8, 
is defined as 



x(m)=/' i a k x(m — k)+e(m) 
k = l 



(9.64) 



where e(m ) is a random signal of variance . The power spectrum of an 
autoregressive process is given by 



$(/) = 



<7 



k= 1 



-jlrfk 



(9.65) 



An AR model extrapolates the correlation sequence beyond the range for 
which estimates are available. The relation between the autocorrelation 
values and the AR model parameters is obtained by multiplying both sides 
of Equation (9.64) by x(m-j) and taking the expectation: 

p 

( E[x(m)x(m — /)] — '^a k E\_x(m—k)x(m — j)\ + < E[e{m)x{m — j) ] (9.66) 

k = l 



Now for the optimal model coefficients the random input e{m) is orthogonal 
to the past samples, and Equation (9.66) becomes 

p 

r xx O') = r xx 0 - k ) , ./= 1,2,... (9.67) 

k = l 



Given P+1 correlation values, Equation (9.67) can be solved to obtain the 
AR coefficients a k . Equation (9.67) can also be used to extrapolate the 

correlation sequence. The methods of solving the AR model coefficients are 
discussed in Chapter 8. 
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9.5.3 Moving-Average Power Spectrum Estimation 

A moving-average model is also known as an all-zero or a finite impulse 
response (FIR) filter. A signal x(m), modelled as a moving-average process, 

is described as 

Q 

x(m)='^ j b k e(m-k) (9.68) 

k =0 



where e(m) is a zero-mean random input and Q is the model order. The 
cross-correlation of the input and output of a moving average process is 
given by 



r xe (m) = ‘E[x(j)e(j-m)] 



= £ 



Q 

X h e (j-k)e(j-m ) 

k- 0 



= <j 2 b 

w e m 



(9.69) 



and the autocorrelation function of a moving average process is 



Q-\m\ 






^hh+m’ \m\<Q 

k = 0 



(9.70) 



0 , 



m \>Q 



From Equation (9.70), the power spectrum obtained from the Fourier 
transform of the autocorrelation sequence is the same as the power spectrum 
of a moving average model of the signal. Hence the power spectrum of a 
moving-average process may be obtained directly from the Fourier 
transform of the autocorrelation function as 



Q 

p xx= X r xx( m ) e ~ fi7zfm (9.71) 

m=—Q 

Note that the moving-average spectral estimation is identical to the 
Blackman-Tukey method of estimating periodograms from the 
autocorrelation sequence. 
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9.5.4 Autoregressive Moving-Average Power Spectrum 
Estimation 

The ARMA, or pole-zero, model is described by Equation (9.47). The 
relationship between the ARMA parameters and the autocorrelation 
sequence can be obtained by multiplying both sides of Equation (9.47) by 
x(m-j ) and taking the expectation: 

r xx O') = - X a k r xx 0 - k ) + Z, b k r xe 0 - k) (9.72) 

k - 1 k - 0 



The moving-average part of Equation (9.72) influences the autocorrelation 
values only up to the lag of Q. Hence, for the autoregressive part of 
Equation (9.72), we have 

p 

r xx (m ) = - Yj a k r xx(m-k) for m > Q (9.73) 

k - 1 



Hence Equation (9.73) can be used to obtain the coefficients a p, which may 
then be substituted in Equation (9.72) for solving the coefficients b p. Once 
the coefficients of an ARMA model are identified, the spectral estimate is 
given by 



pARMA 

r XX 



(/) = <T e 2 




(9.74) 



where o}, is the variance of the input of the ARMA model. In general, the 

poles model the resonances of the signal spectrum, whereas the zeros model 
the anti-resonances of the spectrum. 



9.6 High-Resolution Spectral Estimation Based on Subspace 
Eigen-Analysis 

The eigen-based methods considered in this section are primarily used for 
estimation of the parameters of sinusoidal signals observed in an additive 
white noise. Eigen-analysis is used for partitioning the eigenvectors and the 
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eigenvalues of the autocorrelation matrix of a noisy signal into two 
subspaces: 

(a) the signal subspace composed of the principle eigenvectors 
associated with the largest eigenvalues; 

(b) the noise subspace represented by the smallest eigenvalues. 

The decomposition of a noisy signal into a signal subspace and a noise 
subspace forms the basis of the eigen-analysis methods considered in this 
section. 

9.6.1 Pisarenko Harmonic Decomposition 

A real- valued sine wave can be modelled by a second-order autoregressive 
(AR) model, with its poles on the unit circle at the angular frequency of the 
sinusoid as shown in Figure 9.5. The AR model for a sinusoid of frequency 
Fi at a sampling rate of F s is given by 

x(m)—2cos(27tF i / F x ) x(m - 1 )-x(m - 2 )+AS(m -t 0 ) (9.75) 

where Adim-to) is the initial impulse for a sine wave of amplitude A. In 
general, a signal composed of P real sinusoids can be modelled by an AR 
model of order 2 P as 



2 P 

x(m)=/ i a k x(m — k)+A8(m — tp) (9.76) 

k = 1 




Figure 9.5 A second order all pole model of a sinusoidal signal. 
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The transfer function of the AR model is given by 



H(z) = 



A 



A 



2 P P 

i-I'v -4 Tie z ' 1 ><> - e * ,2 ’ F ' *"’> 



k=l 



k=\ 



(9.77) 



where the angular positions of the poles on the unit circle, e ± ' l2nFk , 
correspond to the angular frequencies of the sinusoids. For P real sinusoids 
observed in an additive white noise, we can write 



y(m) = x(m) + n(m ) 

2 p 

= k ) + n(m ) 

k = l 



(9.78) 



Substituting \y(m-k)-n(m-k )] for x(m-k) in Equation (9.73) yields 

2 p 2 p 

y(m) - ^ xi L-y(m — k)= n(m)-2^^k n ^ m ~ k) (9.79) 

k=\ k — 1 



From Equation (9.79), the noisy sinusoidal signal y(m) can be modelled by 
an ARMA process in which the AR and the MA sections are identical, and 
the input is the noise process. Equation (9.79) can also be expressed in a 
vector notation as 

y T a=n r a (9.80) 

where y T =[y(m), . . y(m-2P )] , a T =[ 1, a\, . . a 2 p] and n T =[n(m), . . 

n(m-2P)]. To obtain the parameter vector a, we multiply both sides of 
Equation (9.80) by the vector y and take the expectation: 

( E\_yy J ]a=‘E[yn i: ]a (9.81) 

or 

R yy a = R yn a (9.82) 



where £[yy T ]=/? 3 , > , , and ‘E[yn l ]=R yil can be written as 




High-Resolution Spectral Estimation 



287 



R yn =‘E[(x + n)n T ] 

= ‘E[nn T ]=R tin =o 2 I 



(9.83) 



where a 2 is the noise variance. Using Equation (9.83), Equation (9.82) 
becomes 



R yy a=ala (9.84) 

Equation (9.84) is in the form of an eigenequation. If the dimension of the 
matrix R yy is greater than 2 P x 2 P then the largest 2 P eigenvalues are 

associated with the eigenvectors of the noisy sinusoids and the minimum 
eigenvalue corresponds to the noise variance o 2 . The parameter vector a is 

obtained as the eigenvector of R yy , with its first element unity and associated 
with the minimum eigenvalue. From the AR parameter vector a, we can 
obtain the frequencies of the sinusoids by first calculating the roots of the 
polynomial 

1 +a x z~ l +a 2 z~ 2 H ba 2 z~ 2P+2 +a l z~ 2P+1 + z~ 2P =0 (9.85) 

Note that for sinusoids, the AR parameters form a symmetric polynomial; 
that is a k =a 2P _ k . The frequencies F k of the sinusoids can be obtained from 

the roots Zk of Equation (9.85) using the relation 

z k =e j2nFk (9.86) 

The powers of the sinusoids are calculated as follows. For P sinusoids 
observed in additive white noise, the autocorrelation function is given by 

p 

r yy (k)=^Pj cos 2knFi +cr 2 d(k) (9.87) 

i-\ 

where P t = A t / 2 is the power of the sinusoid A, sm(2/z7y), and white noise 
affects only the correlation at lag zero r vv (0). Hence Equation (9.87) for the 
correlation lags k= 1, . . ., P can be written as 
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r cos 2 riFy 

cos 4 riF x 

• 

• 


cos 2 tcF 2 

cos4/tF 2 

• 

• 


cos 2 kF p ' 

cos 4 nF P 

• • 

• • 


' P F 

P 2 

• 

• 




Kan 

r yy (2) 

• 

• 


• 

cos2P^F| 


• 

cos 2 PkF 2 


• • 

. . . cos 2 PnF P 


• 

K P P > 




• 

r (P) 

{ yy v } y 



(9.88) 



Given an estimate of the frequencies F, from Equations (9.85) and (86), and 
an estimate of the autocorrelation function r (k ) , Equation (9.88) can be 

solved to obtain the powers of the sinusoids P ( . The noise variance can then 
be obtained from Equation (9.87) as 

p 

0)7 = r yy ( 0) - Y, p i (9-89) 

1 = 1 

9.6.2 Multiple Signal Classification (MUSIC) Spectral Estimation 

The MUSIC algorithm is an eigen-based subspace decomposition method 
for estimation of the frequencies of complex sinusoids observed in additive 
white noise. Consider a signal y(m) modelled as 

p 

y(m)=y' j A k g~j (27r F m +^ > +n(m) (9.90) 

k=l 



An A-sample vector y=\y(m), . . y(m+N- 1)] of the noisy signal can be 
written as 



y - x + n 

= Sa + n 



(9.91) 



where the signal vector x=Sa is defined as 



^ X(m) ^ 




f g j2 nF y m 


e )2 7lF 2 m e )2jrF P m 2 




X(m + 1) 

• 




e flKF { (m+V) 

• 


e }2 7lF 2 (m+l) e i2nFp(m +\ ) 

• • • 


A 2 e pn ^ 

• 


• 

X(m + N — l) 




• 

• 

}2nF\(m+N-\) 

\ 


• • • 

• • • 

\27iF 2 (m+N-l) ]2nF P (m+N-\) 

• • • c ^ 


• 

• 

A p P K $P 
\ A P e ; 



j 

(9.92) 
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The matrix 5 and the vector a are defined on the right-hand side of Equation 
(9.92). The autocorrelation matrix of the noisy signal y can be written as the 
sum of the autocorrelation matrices of the signal x and the noise as 






= SPS H +a2l 



(9.93) 



where R XX =SPS H and R nn =o,p-\ are the autocorrelation matrices of the 

signal and noise processes, the exponent H denotes the Hermitian transpose, 
and the diagonal matrix P defines the power of the sinusoids as 

P=aa H =diag[P l ,P 2 ,...,Pp] (9.94) 



where Pj = Af is the power of the complex sinusoid e 2nF ' . The 
correlation matrix of the signal can also be expressed in the form 

p 

R xx =J L p kSkSk (9.95) 

k - 1 



where [l,e pnFk ,---,e pmN 1 )Fk ] . Now consider an eigen-decomposition 
of the Nx N correlation matrix R YY 







k=\ 



~^K v k v ^ 



k - 1 



(9.96) 



where and are the eigenvalues and eigenvectors of the matrix R xx 
respectively. We have also used the fact that the autocorrelation matrix R xx 
of P complex sinusoids has only P non-zero eigenvalues, Xp+\=Xp + 2 , ..., 
X N -0. Since the sum of the cross-products of the eigenvectors forms an 

identity matrix we can also express the diagonal autocorrelation matrix of 
the noise in terms of the eigenvectors of R xx as 
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N 



D „ 



H 



MM 



k= 1 



(9.97) 



The correlation matrix of the noisy signal may be expressed in terms of its 
eigenvectors and the associated eigenvalues of the noisy signal as 



N 



R 



yy 



='L' k k v k.vf +o 2 n ^v k v 

k - 1 k - 1 



H 

k 



=X(^ + ct «K v 



H 

k 



k = 1 



fc=P+l 



(9.98) 



From Equation (9.98), the eigenvectors and the eigenvalues of the 
correlation matrix of the noisy signal can be partitioned into two disjoint 
subsets (see Figure 9.6). The set of eigenvectors { Vi, . . ., Vp }, associated 

with the P largest eigenvalues span the signal subspace and are called the 
principal eigenvectors. The signal vectors s-, can be expressed as linear 

combinations of the principal eigenvectors. The second subset of 
eigenvectors {vp + \, . . ., v^} span the noise subspace and have as their 

eigenvalues. Since the signal and noise eigenvectors are orthogonal, it 
follows that the signal subspace and the noise subspace are orthogonal. 
Hence the sinusoidal signal vectors s ( - which are in the signal subspace, are 

orthogonal to the noise subspace, and we have 
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Figure 9.6 Decomposition of the eigenvalues of a noisy signal into the principal 

eigenvalues and the noise eigenvalues. 
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s ?(f>k= YjV k (m)e j27lF ‘ m =0 

m = 0 




k — P + 1,. . . ,N 



(9.99) 



Equation (9.99) implies that the frequencies of the P sinusoids can be 
obtained by solving for the zeros of the following polynomial function of 
the frequency variable/: 






k=P + 1 



(9.100) 



In the MUSIC algorithm, the power spectrum estimate is defined as 



P Xx(f)~ 

k=P + 1 



(9.101) 



where s(f) = [1, eJ 2jl f, . . ei 2niN ' 1 >/] is the complex sinusoidal vector, and 
[v P+ 1 , . . . ,v N } are the eigenvectors in the noise subspace. From Equations 

(9.102) and (9.96) we have that 

p xx(fi) = 0, P (9.102) 



Since Pxxif) has its zeros at the frequencies of the sinusoids, it follows that 
the reciprocal of Pxx{f) has its poles at these frequencies. The MUSIC 
spectrum is defined as 



r, MUSIC 
r XX 
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k=P + 1 



2 



1 

s U (f)V(f)V u (f)s(f) 



(9.103) 



where V=[v P+ \, . . . , v , v ] is the matrix of eigenvectors of the noise subspace. 
p MusicAf) is sharply peaked at the frequencies of the sinusoidal components 
of the signal, and hence the frequencies of its peaks are taken as the MUSIC 
estimates. 
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9.6.3 Estimation of Signal Parameters via Rotational Invariance 
Techniques (ESPRIT) 

The ESPIRIT algorithm is an eigen-decomposition approach for estimating 
the frequencies of a number of complex sinusoids observed in additive white 
noise. Consider a signal y(m) composed of P complex- valued sinusoids and 
additive white noise: 



y(m)='^A k e ^ 2nFk>n+ ^ +n(m) 
k - 1 



(9.104) 



The ESPIRIT algorithm exploits the deterministic relation between 
sinusoidal component of the signal vector y(m)=\y(m), . . y(m+N- 1] T and 
that of the time-shifted vector y(m+l)=[y(m+l), . . ., y(m+AT)] T . The signal 
component of the noisy vector y(m) may be expressed as 

x(m)-Sa (9.105) 

where S is the complex sinusoidal matrix and a is the vector containing the 
amplitude and phase of the sinusoids as in Equations (9.91) and (9.92). A 

j 2nFm 

complex sinusoid e 1 can be time-shifted by one sample through 

j2nF- 

multiplication by a phase term e ' . Hence the time-shifted sinusoidal 
signal vector x(m+ 1) may be obtained from x(m) by phase- shifting each 
complex sinusoidal component of x(m) as 

x(m + \)-SOa (9.106) 

where <2> is a P x P phase matrix defined as 

0 - diag[e ; ' 2?rFl ,e jl7tFl , • • • ,e jl7tFp ] (9. 107) 



The diagonal elements of O are the relative phases between the adjacent 
samples of the sinusoids. The matrix <2> is a unitary matrix and is known as 
a rotation matrix since it relates the time-shifted vectors x(m) and x(m+ 1). 
The autocorrelation matrix of the noisy signal vector y(m) can be written as 




y(m)y(m) 



= SPS H +cll 



(9.108) 
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where the matrix P is diagonal, and its diagonal elements are the powers of 

2 2 H 

the complex sinusoids P=diag[Aj , A P ] — aa . The cross-covariance 
matrix of the vectors y(m) and y(m+ 1) is 

Ry(m)y(m+ 1) =SP& K S H +R n ( m ) n ( m+ i) (9.109) 



where the autocovariance matrices Ry( m}y (m+\) and R n (m)n(m+ 1 ) a 1 " 6 defined as 
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(9.111) 



The correlation matrix of the signal vector x(m) can be estimated as 




x(m)x(m) 




y(m)y(m) 




n(m)n(m) 



= sps H 



(9.112) 



and the cross-correlation matrix of the signal vector x(m) with its time- 
shifted version x(m+ 1) is obtained as 




x(m)x(m+ 1 ) 




y(m)y(m+ 1 ) 




n(m)n(m+ 1 ) 



SP0 H S n 



(9.113) 



Subtraction of a fraction A ( - =e ’~ nF ' of Equation (9.113) from Equation 
(9.1 12) yields 




x(m)x(m) 



-M 



x(m)x(m+ 1 ) 



= SP(l-X i O h )S H 



(9.114) 
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From Equations (9.107) and (9.114), the frequencies of the sinusoids can be 
estimated as the roots of Equation (9. 1 14). 



9.7 Summary 

Power spectrum estimation is perhaps the most widely used method of 
signal analysis. The main objective of any transformation is to express a 
signal in a form that lends itself to more convenient analysis and 
manipulation. The power spectrum is related to the correlation function 
through the Fourier transform. The power spectrum reveals the repetitive 
and correlated patterns of a signal, which are important in detection, 
estimation, data forecasting and decision-making systems. We began this 
chapter with Section 9. 1 on basic definitions of the Fourier series/transform, 
energy spectrum and power spectrum. In Section 9.2, we considered non- 
parametric DFT-based methods of spectral analysis. These methods do not 
offer the high resolution of parametric and eigen-based methods. However, 
they are attractive in that they are computationally less expensive than 
model-based methods and are relatively robust. In Section 9.3, we 
considered the maximum-entropy and the model-based spectral estimation 
methods. These methods can extrapolate the correlation values beyond the 
range for which data is available, and hence can offer higher resolution and 
less side-lobes. In Section 9.4, we considered the eigen-based spectral 
estimation of noisy signals. These methods decompose the eigen variables 
of the noisy signal into a signal subspace and a noise subspace. The 
orthogonality of the signal and noise subspaces is used to estimate the signal 
and noise parameters. In the next chapter, we use DFT-based spectral 
estimation for restoration of signals observed in noise. 
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10.1 Introduction 

10.2 Polynomial Interpolation 

10.3 Model-Based Interpolation 

10.4 Summary 

I nterpolation is the estimation of the unknown, or the lost, samples of a 
signal using a weighted average of a number of known samples at the 
neighbourhood points. Interpolators are used in various forms in most 
signal processing and decision making systems. Applications of 
interpolators include conversion of a discrete-time signal to a continuous- 
time signal, sampling rate conversion in multirate communication systems, 
low-bit-rate speech coding, up-sampling of a signal for improved graphical 
representation, and restoration of a sequence of samples irrevocably 
distorted by transmission errors, impulsive noise, dropouts, etc. This 
chapter begins with a study of the basic concept of ideal interpolation of a 
band-limited signal, a simple model for the effects of a number of missing 
samples, and the factors that affect the interpolation process. The classical 
approach to interpolation is to construct a polynomial that passes through 
the known samples. In Section 10.2, a general form of polynomial 
interpolation and its special forms, Lagrange, Newton, Hermite and cubic 
spline interpolators, are considered. Optimal interpolators utilise predictive 
and statistical models of the signal process. In Section 10.3, a number of 
model-based interpolation methods are considered. These methods include 
maximum a posteriori interpolation, and least square error interpolation 
based on an autoregressive model. Finally, we consider time-frequency 
interpolation, and interpolation through searching an adaptive signal 
codebook for the best-matching signal. 
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10.1 Introduction 

The objective of interpolation is to obtain a high-fidelity reconstruction of 
the unknown or the missing samples of a signal. The emphasis in this 
chapter is on the interpolation of a sequence of lost samples. However, first 
in this section, the theory of ideal interpolation of a band-limited signal is 
introduced, and its applications in conversion of a discrete-time signal to a 
continuous-time signal and in conversion of the sampling rate of a digital 
signal are considered. Then a simple distortion model is used to gain insight 
on the effects of a sequence of lost samples and on the methods of recovery 
of the lost samples. The factors that affect interpolation error are also 
considered in this section. 

10.1.1 Interpolation of a Sampled Signal 

A common application of interpolation is the reconstruction of a 
continuous-time signal x(t) from a discrete-time signal x(m). The condition 
for the recovery of a continuous-time signal from its samples is given by the 
Nyquist sampling theorem. The Nyquist theorem states that a band-limited 
signal, with a highest frequency content of F c (Hz), can be reconstructed 
from its samples if the sampling speed is greater than 2 F c samples per 

second. Consider a band-limited continuous-time signal x(t), sampled at a 
rate of F s samples per second. The discrete-time signal x(m) may be 

expressed as the following product: 



Time 




Frequency 




freq 



Figure 10.1 Reconstruction of a continuous-time signal from its samples. In 
frequency domain interpolation is equivalent to low-pass filtering. 
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Original signal 

i 



Zero inserted signal 



Interpolated signal 



time 

time time 

Figure 10.2 Illustration of up-sampling by a factor of 3 using a two-stage process 

of zero-insertion and digital low-pass filtering. 



oo 



x(m)=x(t) p(t)= ^x(t)8(t-mT s ) 



( 10 . 1 ) 



m=—oo 



where p(t)=L8(t-mT s ) is the sampling function and T s = \/F s is the sampling 
interval. Taking the Fourier transform of Equation (10.1), it can be shown 
that the spectrum of the sampled signal is given by 



oo 

X s (f) = X(f)*P(f)= ^X(f + kf s ) (10.2) 

k=—oo 

where X(f) and P(f) are the spectra of the signal x(t) and the sampling 
function p(t) respectively, and * denotes the convolution operation. 
Equation (10.2), illustrated in Figure 10.1, states that the spectrum of a 
sampled signal is composed of the original base-band spectrum X(f) and the 
repetitions or images of X(f) spaced uniformly at frequency intervals of 
F s =l/T s . When the sampling frequency is above the Nyquist rate, the base- 
band spectrum X(f) is not overlapped by its images X(f±kF s ), and the 
original signal can be recovered by a low-pass filter as shown in Figure 
10.1. Hence the ideal interpolator of a band-limited discrete-time signal is 
an ideal low-pass filter with a sine impulse response. The recovery of a 
continuous-time signal through sine interpolation can be expressed as 



oo 

*( 0 = X x ( m )T s f c sine [jlf c (t - mT s )] (10.3) 

m =— oo 



In practice, the sampling rate F s should be sufficiently greater than 2 F c , say 
2.5 F c , in order to accommodate the transition bandwidth of the 
interpolating low-pass filter. 
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10.1.2 Digital Interpolation by a Factor of / 



Applications of digital interpolators include sampling rate conversion in 
multirate communication systems and up-sampling for improved graphical 
representation. To change a sampling rate by a factor of V-l/D (where 7 and 
D are integers), the signal is first interpolated by a factor of 7, and then the 
interpolated signal is decimated by a factor of D. 

Consider a band-limited discrete-time signal x(m ) with a base-band 
spectrum X(f) as shown in Figure 10.2. The sampling rate can be increased 
by a factor of 7 through interpolation of 7-1 samples between every two 
samples of x(m). In the following it is shown that digital interpolation by a 
factor of 7 can be achieved through a two-stage process of (a) insertion of 7- 
1 zeros in between every two samples and (b) low-pass filtering of the zero- 
inserted signal by a filter with a cutoff frequency of FJ2I, where F s is the 
sampling rate. Consider the zero-inserted signal x z (m ) obtained by inserting 
7-1 zeros between every two samples of x(m) and expressed as 



f m x 



x z (m )=< j 



x 



v' / 



0 , 



m=0,± 7,± 27,... 



otherwise 



(10.4) 



The spectrum of the zero-inserted signal is related to the spectrum of the 
original discrete-time signal by 



oo 

X z (/)= Yj x z^ e ~ i2 ^ fm 

m=—oo 

oo 

= ^x(m)e- j2 ^ nI ( 10 . 5 ) 

777— —oo 

= X(Ff ) 

Equation (10.5) states that the spectrum of the zero-inserted signal X z (f) is a 
frequency-scaled version of the spectrum of the original signal X(f). Figure 
10.2 shows that the base-band spectrum of the zero-inserted signal is 
composed of 7 repetitions of the based band spectrum of the original signal. 
The interpolation of the zero-inserted signal is therefore equivalent to 
filtering out the repetitions of X(f) in the base band of X z (f), as illustrated in 

Figure 10.2. Note that to maintain the real-time duration of the signal the 
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sampling rate of the interpolated signal x z (m) needs to be increased by a 
factor of I. 

10.1.3 Interpolation of a Sequence of Lost Samples 

In this section, we introduce the problem of interpolation of a sequence of 
M missing samples of a signal given a number of samples on both side of 
the gap, as illustrated in Figure 10.3. Perfect interpolation is only possible if 
the missing samples are redundant, in the sense that they carry no more 
information than that conveyed by the known neighbouring samples. This 
will be the case if the signal is a perfectly predictable signal such as a sine 
wave, or in the case of a band-limited random signal if the sampling rate is 
greater than M times the Nyquist rate. However, in many practical cases, 
the signal is a realisation of a random process, and the sampling rate is only 
marginally above the Nyquist rate. In such cases, the lost samples cannot be 
perfectly recovered, and some interpolation error is inevitable. 

A simple distortion model for a signal y(m) with M missing samples, 
illustrated in Figure 10.3, is given by 

=x(m)[l — r(m)] (10.6) 



where the distortion operator d(m ) is defined as 

d(m)=l-r(m) (10.7) 

and r(m) is a rectangular pulse of duration M samples starting at the 
sampling time k: 




Figure 10.3 Illustration of a distortion model for a signal with a sequence of 

missing samples. 
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1, k<m<k + M - 1 

r(m) — < 

[0, otherwise 

In the frequency domain, Equation (10.6) becomes 

Y(f)=X(f)*D(f) 

=X(f)*[S(f)-R(f)] 

=X(f)-X(f)*R(f) 



( 10 . 8 ) 



(10.9) 



where D(f) is the spectrum of the distortion d(m), 8(f) is the Kronecker delta 
function, and R(f), the frequency spectrum of the rectangular pulse r(m), is 
given by 



R( f) =e -j2*f[k+(M- q/2] sin (nfM) 

sin W) 



( 10 . 10 ) 



In general, the distortion d(m ) is a non-invertible, many-to-one 
transformation, and perfect interpolation with zero error is not possible. 
However, as discussed in Section 10.3, the interpolation error can be 
minimised through optimal utilisation of the signal models and the 
information contained in the neighbouring samples. 

Example 10.1 Interpolation of missing samples of a sinusoidal signal. 
Consider a cosine waveform of amplitude A and frequency Fq with M 

missing samples, modelled as 



y(m)= x(m) d(m) 

= A(cos 2jif' () m )[l — r(m)b] 



( 10 . 11 ) 



where r(m) is the rectangular pulse defined in Equation (10.7). In the 
frequency domain, the distorted signal can be expressed as 



Y(f)=^[S(f-f 0 )+S(f+f c )Y[S(f)-R(f)] 

=^[S(f-f 0 )+S(f + fo )-R(f - fo )-R ( f + /„)] 



( 10 . 12 ) 



where R(f) is the spectrum of the pulse r(m) as in Equation (10.9). 
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From Equation (10.12), it is evident that, for a cosine signal of 
frequency Fq, the distortion in the frequency domain due to the missing 
samples is manifested in the appearance of sine functions centred at ± Fq. 
The distortion can be removed by filtering the signal with a very narrow 
band-pass filter. Note that for a cosine signal, perfect restoration is possible 
only because the signal has infinitely narrow bandwidth, or equivalently 
because the signal is completely predictable. In fact, for this example, the 
distortion can also be removed using a linear prediction model, which, for a 
cosine signal, can be regarded as a data-adaptive narrow band-pass filter. 

10.1.4 The Factors That Affect Interpolation Accuracy 

The interpolation accuracy is affected by a number of factors, the most 
important of which are as follows: 

(a) The predictability, or correlation structure of the signal: as the 
correlation of successive samples increases, the predictability of a 
sample from the neighbouring samples increases. In general, 
interpolation improves with the increasing correlation structure, or 
equivalently the decreasing bandwidth, of a signal. 

(b) The sampling rate: as the sampling rate increases, adjacent samples 
become more correlated, the redundant information increases, and 
interpolation improves. 

(c) Non-stationary characteristics of the signal: for time-varying signals 
the available samples some distance in time away from the missing 
samples may not be relevant because the signal characteristics may 
have completely changed. This is particularly important in 
interpolation of a large sequence of samples. 

(d) The length of the missing samples: in general, interpolation quality 
decreases with increasing length of the missing samples. 

(e) Finally, interpolation depends on the optimal use of the data and the 
efficiency of the interpolator. 

The classical approach to interpolation is to construct a polynomial 
interpolator function that passes through the known samples. We continue 
this chapter with a study of the general form of polynomial interpolation, 
and consider Lagrange, Newton, Hermite and cubic spline interpolators. 
Polynomial interpolators are not optimal or well suited to make efficient use 
of a relatively large number of known samples, or to interpolate a relatively 
large segment of missing samples. 
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In Section 10.3, we study several statistical digital signal processing 
methods for interpolation of a sequence of missing samples. These include 
model-based methods, which are well suited for interpolation of small to 
medium sized gaps of missing samples. We also consider frequency-time 
interpolation methods, and interpolation through waveform substitution, 
which have the ability to replace relatively large gaps of missing samples. 



10.2 Polynomial Interpolation 

The classical approach to interpolation is to construct a polynomial 
interpolator that passes through the known samples. Polynomial 
interpolators may be formulated in various forms, such as power series, 
Lagrange interpolation and Newton interpolation. These various forms are 
mathematically equivalent and can be transformed from one into another. 
Suppose the data consists of N + 1 samples {v(t 0 ), x{t\), ..., v(7 N ) }, where 

x(t n ) denotes the amplitude of the signal x(t) at time t n . The polynomial of 

order N that passes through the N + 1 known samples is unique (Figure 10.4) 
and may be written in power series form as 

x(t) = p N (t ) -a 0 +a x t + a 2 t 2 +a 3 t 3 H 1 - a N t N (10.13) 

where P^it) is a polynomial of order N, and the c// t - are the polynomial 
coefficients. From Equation (10.13), and a set of /V+l known samples, a 




Figure 10.4 Illustration of an Interpolation curve through a number of samples. 
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system of N + 1 linear equations with N+l unknown coefficients can be 
formulated as 



2 3 N 

X(t q) = «o "I" <3 1 ? q + <^2^0 + ^3^0 "I” " " '"I” 

2 2 w 

) = £ 2 q + + <3 2 + <33/^1 + 



(10.14) 



2 3 N 

x(t N )= ciq + a x t pj + cijtpj + a 2 t N - 1 H ^ 



From Equation (10.14). the polynomial coefficients are given by 
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(10.15) 



The matrix in Equation (10.15) is called a Vandermonde matrix. For a large 
number of samples, N, the Vandermonde matrix becomes large and ill- 
conditioned. An ill-conditioned matrix is sensitive to small computational 
errors, such as quantisation errors, and can easily produce inaccurate results. 
There are alternative methods of implementation of the polynomial 
interpolator that are simpler to program and/or better structured, such as 
Lagrange and Newton methods. However, it must be noted that these 
variants of the polynomial interpolation also become ill-conditioned for a 
large number of samples, N. 

10.2.1 Lagrange Polynomial Interpolation 

To introduce the Lagrange interpolation, consider a line interpolator passing 
through two points x(to) and x(t\): 



x(t) = p x (0 = x(t 0 ) + x ^i ^ ^ ( t -t Q ) 

h ~h 

v 

line slope 






J 



(10.16) 
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Figure 10.5 The Lagrange line interpolator passing through x(t 0 ) and x(f 1 ), 

described in terms of the combination of two lines: one passing through 

(x(t 0 ), f-|) and the other through (x(t|), t 0 ). 



The line Equation (10.16) may be rearranged and expressed as 

Pi(t) = — —x(t 0 )+— —x(t x ) (10.17) 

U) ~h 0 o 

Equation (10.17) is in the form of a Lagrange polynomial. Note that the 
Lagrange form of a line interpolator is composed of the weighted 
combination of two lines, as illustrated in Figure 10.5. 

In general, the Lagrange polynomial, of order N, passing through /V+1 
samples {x(t 0 ), x(t {), ... x(t N ) } is given by the polynomial equation 

P N (t) = L 0 (t)x(t 0 ) + Lj (t )x(t l ) + ••• + L n (t)x{t N ) (10.18) 

where each Lagrange coefficient L,\{t) is itself a polynomial of degree N 
given by 



Lj (t) = 



(t-t 0 ) ■ • -(f-Q-i ) (t~t i+ 1 )---(t-t N ) 
(0 - l o ) • • • - (t i -O-i ) (0 -O+i ) • • • (0 ~ f N ) 



N 

n 



t~t 



n 



n = 0 O' 

n^i 



(10.19) 



Note that the i th Lagrange polynomial coefficient Lj(t) becomes unity at the 
I th known sample point (i.e. L, ■(?,■)= 1), and zero at every other known sample 
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(i.e. Lj(tj)= 0, i ^ j ). Therefore P^( t,)=L,( t,)x( tj)=x( /)■), and the polynomial 

passes through the known data points as required. 

The main drawbacks of the Lagrange interpolation method are as 
follows: 

(a) The computational complexity is large. 

(b) The coefficients of a polynomial of order N cannot be used in the 
calculations of the coefficients of a higher order polynomial. 

(c) The evaluation of the interpolation error is difficult. 

The Newton polynomial, introduced in the next section, overcomes some of 
these difficulties. 



10.2.2 Newton Polynomial Interpolation 

Newton polynomials have a recursive structure, such that a polynomial of 
order N can be constructed by extension of a polynomial of order TV— 1 as 
follows: 



p 0 (t)=a 



0 



(d.c. value) 



Pi(t)= a 0 + a x (t-t 0 ) 

= Po(t) + a l (t-t 0 ) 



(ramp) 



p 2 (t)-a 0 + a x (t -t 0 ) + a 2 (t -t 0 )(t -t x ) 






= Pi(t) + a 2 (t-t 0 )(t-t l ) 
p 3 ( t ) = a 0 +a x (t-t Q )+a 2 (t-t Q )(t-t x ) +a 3 ( t - 1 0 )(t - t x )(t - 1 2 ) 



(quadratic) 






Pid) 



+ a 3 (t-t 0 )(t-t x )(t-t 2 ) 



(cubic) 



( 10 . 20 ) 



and in general the recursive, order update, form of a Newton polynomial 
can be formulated as 



P n (0=/> am (t)+a N {t - f o )(t - t x ) • • • {t - t N _ x ) 



( 10 . 21 ) 
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For a sequence of /V+1 samples (x(t 0 ), x(t x ), ... x(t N )}, the polynomial 

coefficients are obtained using the constraint p n ( t i) =x (t ( ) as follows: To 
solve for the coefficient ao, equate the polynomial Equation (10.21) at t=t 0 
to x(to): 

P n (h )= Po ( { o )= x ( { o )= a o ( 1 0.22) 

To solve for the coefficient a\, the first-order polynomial p t (t) is evaluated 
at t=t\\ 

Pi (t\ )=x(h )=a 0 +a ] (t l - 1 0 )=x(t 0 )+a l (tj - 1 0 ) ( 1 0.23) 



from which 

_ -x^i ) - x (t 0 ) 

a i — 

h-h 



(10.24) 



Note that the coefficient a \ is the slope of the line passing through the 
points [v(t 0 ), x(t\ )J. To solve for the coefficient ci 2 the second-order 
polynomial p 2 (t ) is evaluated at t=ty. 



p 2 ( t 2 )-x(t 2 )-a 0 +a x ( t 2 - 1 0 )+a 2 ( t 2 -t 0 )(t 2 -t l ) ( 1 0.25) 

Substituting ao and a\ from Equations (10.22) and (10.24) in Equation 
(10.25) we obtain 



x(t 2 )-x(t\) 
ti ~t\ 



x(h )~x(tp ) 

t\~to 




(10.26) 



Each term in the square brackets of Equation (10.26) is a slope term, and 
the coefficient <22 is the slope of the slope. To formulate a solution for the 

higher-order coefficients, we need to introduce the concept of divided 
differences. Each of the two ratios in the square brackets of Equation 
(10.26) is a so-called “divided difference”. The divided difference between 
two points tj and tj_\ is defined as 



di (ti-i 




x(tj ) - x(t t _ j ) 



(10.27) 
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The divided difference between two points may be interpreted as the 
average difference or the slope of the line passing through the two points. 
The second-order divided difference (i.e. the divided difference of the 
divided difference) over three points 2 , tj_\ and t\ is given by 



^ 2 ( h-2 




d\ (tj-i 



ti )-d x (t i _ 2 



h ~ h-2 




(10.28) 



and the third-order divided difference is 



d 3 (tj_ 3 



t \_d 2 idi-2 ’ h )~^2 (h-3 
l i>~ . 




(10.29) 



and so on. In general the j th order divided difference can be formulated in 
terms of the divided differences of order j— 1 , in an order-update equation 
given as 



djdj-j ,tj )— 



d j - 1 (O-/+ 1 , ^ )~dj - 1 (tj-j , 0 _i ) 



t . — t . 
l l 



(10.30) 



Note that a x = d x (tQ,t x ) , a 2 =d 2 (tQ,t 2 ) and <33 = d 3 (t 0 ,t 3 ) , and in 

general the Newton polynomial coefficients are obtained from the divided 
differences using the relation 

a i= d i(t 0 ,ti ) (10.31) 

A main advantage of the Newton polynomial is its computational 
efficiency, in that a polynomial of order N - 1 can be easily extended to a 
higher-order polynomial of order N. This is a useful property in the 
selection of the best polynomial order for a given set of data. 

10.2.3 Hermite Polynomial Interpolation 

Hermite polynomials are formulated to fit not only to the signal samples, 
but also to the derivatives of the signal as well. Suppose the data consists of 
N + 1 samples and assume that all the derivatives up to the M th order 
derivative are available. Let the data set, i.e. the signal samples and the 

derivatives, be denoted as [ v(r ; ),x'(tj ),x"(t i ),. . . (t ; ), i = 0,. . . ,N ] . There 
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are altogether A=(A+1)(M+1) data points and a polynomial of order K— 1 
can be fitted to the data as 

p(t) — Uq ^ +• • ' ^ (10.32) 

To obtain the polynomial coefficients, we substitute the given samples in 
the polynomial and its M derivatives as 



piti) 


x(ti ) 




P\h ) = 


x'iti ) 




P%) = 

• 

• 


X % ) 

• 

• 


(10.33) 




• 





In all, there are K=(M+\ )(/V+l) equations in (10.33), and these can be used 
to calculate the coefficients of the polynomial Equation (10.32). In theory, 
the constraint that the polynomial must also fit the derivatives should result 
in a better interpolating polynomial that passes through the sampled points 
and is also consistent with the known underlying dynamics (i.e. the 
derivatives) of the curve. However, even for moderate values of N and M, 
the size of Equation (10.33) becomes too large for most practical purposes. 

10.2.4 Cubic Spline Interpolation 

A polynomial interpolator of order N is constrained to pass through N + 1 
known samples, and can have N - 1 maxima and minima. In general, the 
interpolation error increases rapidly with the increasing polynomial order, 
as the interpolating curve has to wiggle through the N + 1 samples. When a 
large number of samples are to be fitted with a smooth curve, it may be 
better to divide the signal into a number of smaller intervals, and to fit a low 
order interpolating polynomial to each small interval. Care must be taken to 
ensure that the polynomial curves are continuous at the endpoints of each 
interval. In cubic spline interpolation, a cubic polynomial is fitted to each 
interval between two samples. A cubic polynomial has the form 

2 3 

p(t)=a 0 +a 1 t+a 2 t +a 3 t 



(10.34) 
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A cubic polynomial has four coefficients, and needs four conditions for the 
determination of a unique set of coefficients. For each interval, two 
conditions are set by the samples at the endpoints of the interval. Two 
further conditions are met by the constraints that the first derivatives of the 
polynomial should be continuous across each of the two endpoints. 

Consider an interval t t <t<t M of length Tj=[j + \—tj as shown in Figure 10.6. 

Using a local coordinate z=t-tj , the cubic polynomial becomes 

p(T)=ciQ +a 1 T+a 2 T 2 + a 3 T 3 (10.35) 

At T=0, we obtain the first coefficient ciq as 

a 0 = p(T=0)=x(t i ) (10.36) 

The second derivative of p(t) is given by 

p'(T)=2a 2 +6a 3 T (10.37) 



Evaluation of the second derivative at T=0 (i.e. t=tj) gives the coefficient a 2 
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a 



pK t ~ 0 ) Pi 



(10.38) 



Similarly, evaluating the second derivative at the point t l+ \ (i.e. T= 7) ) yields 
the fourth coefficient 



a 




rr rr 

Pm~ Pi 




(10.39) 



Now to obtain the coefficient ct\, we evaluate pit) at T=T(. 

pit =T i )=a 0 +a l T i +a 2 T i 2 +a 3 T i 3 =x(t i+l ) (10.40) 

and substitute ao, a 2 and a 3 from Equations (10.36), (10.38) and (10.39) in 
(10.40) to obtain 



x(t i+1 )-x(ti) p" i+ i+2p \ 
a x = /, 



T: 



(10.41) 



The cubic polynomial can now be written as 



p(T) = X(t:) + 



x(t M )-x(ti ) p ■ + 1+2 p"i 



T: 



T: 



// rr rr 

T+— t 2 + ^ ,+1 — T 3 (10.42) 
2 6T; 



To determine the coefficients of the polynomial in Equation (10.42), we 
need the second derivatives and p M . These are obtained from the 

constraint that the first derivatives of the curves at the endpoints of each 
interval must be continuous. From Equation (10.42), the first derivatives of 
p( z) evaluated at the endpoints t, and tj+\ are 

Pi = =Q)=~-[p '+ 1 +2p'hzp-[x(t i+l )-x(ti)] (10.43) 

n / . 



Pm = p'(* = T i )~\- 2 PM+pJ]+^r\. x ( t M )] 

6 7. 



(10.44) 
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Similarly, for the preceding interval, r ( _| </</,, the first derivative of the 
cubic spline curve evaluated at z=tj is given by 



T: 



Pi=p'( * ~6 ) — ~r~ \2Pi + pj-i ]+=M*(6)-*(6-i)] 



1 



T: 



i - 1 



(10.45) 



For continuity of the first derivative at f/, p' at the end of the interval (f,-_ i 

,ti) must be equal to the p' at the start of the interval (tj Equating the 

right-hand sides of Equations (10.43) and (10.45) and repeating this 
exercise yields 



T;-i Pm +2(7/-! +7) )p”+T i p" +l =6 



1 , , 


f i 1 ) 


1 


^ *(6-i ) 


+ 


*(6)+ *(6+i ) 


U-i 


y T i- 1 > 


^ i 



i= 1,2,.. N-l 



(10.46) 



In Equation (10.46), there are TV— 1 equations in N + 1 unknowns p" . For a 

unique solution we need to specify the second derivatives at the points to 
and tjv- This can be done in two ways: (a) setting the second derivatives at 
the endpoints to and tjy (i.e. p" } and p " N ), to zero, or (b) extrapolating the 
derivatives from the inside data. 



10.3 Model-Based Interpolation 

The statistical signal processing approach to interpolation of a sequence of 
lost samples is based on the utilisation of a predictive and/or a probabilistic 
model of the signal. In this section, we study the maximum a posteriori 
interpolation, an autoregressive model-based interpolation, a frequency- 
time interpolation method, and interpolation through searching a signal 
record for the best replacement. 

Figures 10.7 and 10.8 illustrate the problem of interpolation of a sequence 
of lost samples. It is assumed that we have a signal record of N samples, 
and that within this record a segment of M samples, starting at time k, 
x uk ={x(k), x(k+ 1 ), ..., x(k+M -\) } are missing. The objective is to make an 

optimal estimate of the missing segment x Uk , using the remaining N-k 
samples x Kn and a model of the signal process. An A-sample signal vector 
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Figure 10.7 Illustration of a model-based iterative signal interpolation system. 




x 



Kn2 



Aa J 




J 


L . 


1 


V 




J 


\/v 

time 



P samples before 



^ M missing ^ 
samples 



P samples after 



Figure 10.8 A signal with M missing samples and N-M known samples. On each 
side of the missing segment, P samples are used to interpolate the segment. 



x, composed of M unknown samples and N-M known samples, can be 
written as 





f v 'N 

x Kn l 




( v V 

x Kn l 




f 0 ) 




X = 


X U 


— 


0 


+ 


x Uk 


~K x Kn + U x Uk 




^ X Kn 2 J 




k x k» 2 j 




® J 





(10.47) 



where the vector x Kn =[x Knl x Kn2 J T is composed of the known samples, and 

the vector x Uk is composed of the unknown samples, as illustrated in Figure 

10.8. The matrices K and U in Equation (10.47) are rearrangement matrices 
that assemble the vector x from x Kn and x Uk . 
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10.3.1 Maximum A Posteriori Interpolation 

The posterior pdf of an unknown signal segment x kJk given a number of 
neighbouring samples x Kn can be expressed using Bayes’ rule as 






■/A (*Kn>*Uk) 

fx (*Kn) 



fx( x ~ ^-^Kn+^-^Uk) 
fx( x Kn) 



(10.48) 



In Equation (10.48), for a given sequence of samples x Kn , /y(x Kn ) is a 

constant. Therefore the estimate that maximises the posterior pdf, i.e. the 
MAP estimate, is given by 

/v yV/ j\ F* r / tz w t \ 

x Uk =argmax/ x (tfx Kn +£/x Uk ) (10.49) 

*^Uk 



Example 10.2 MAP interpolation of a Gaussian signal . Assume that an 
observation signal jr=i&r Kn +L / jc Uk , from a zero-mean Gaussian process, is 

composed of a sequence of M missing samples x uk and N—M known 

neighbouring samples as in Equation (10.47). The pdf of the signal x is 
given by 



fx ( x ) — 



1 



(2k) 



Nil 



XX 



1/2 



exp 



/ _J L T y - 1 ^ 

X 2/ rr X 

\ 1 ) 



(10.50) 



where D xx is the covariance matrix of the Gaussian vector process x. 

Substitution of Equation (10.50) in Equation (10.48) yields the conditional 
pdf of the unknown signal x Uk given a number of samples x Kn : 



fx( X Uk X Kn) 



1 



1 



fx( X Kn) ( 2 K) 



Nil 



XX 



111 



X 



exp 



f 1 t A 

— ^ *^Kn *^Uk ) ^xx *^Kn *^Uk ) 



V 



(10.51) 



j 
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Figure 10.9 Illustration of MAP interpolation of a segment of 20 samples. 



The MAP signal estimate, obtained by setting the derivative of the log- 
likelihood function ln_/xf.Jcl.JC Kn ) of Equation (10.51) with respect to jc Uk to 

zero, is given by 



*uk =-k T 2 i u)-'u T s;l Kx Ka 
An example of MAP interpolation is shown in Figure 10.9. 



(10.52) 



10.3.2 Least Square Error Autoregressive Interpolation 

In this section, we describe interpolation based on an autoregressive (AR) 
model of the signal process. The term “autoregressive model” is an 
alternative terminology for the linear predictive models considered in 
Chapter 7. In this section, the terms “linear predictive model” and 
“autoregressive model” are used interchangeably. The AR interpolation 
algorithm is a two-stage process: in the first stage, the AR model 
coefficients are estimated from the incomplete signal, and in the second 
stage the estimates of the model coefficients are used to interpolate the 
missing samples. For high-quality interpolation, the estimation algorithm 
should utilise all the correlation structures of the signal process, including 
periodic or pitch period structures. In Section 10.3.4, the AR interpolation 
method is extended to include pitch-period correlations. 



Model-Based Interpolation 



317 



10.3.3 Interpolation Based on a Short-Term Prediction Model 

An autoregressive (AR), or linear predictive, signal x(m) is described as 

p 

x(m)=^a k x(m-k)+e(m) (10.53) 

k=l 



where x(m) is the AR signal, a k are the model coefficients and e(m) is a zero 

mean excitation signal. The excitation may be a random signal, a quasi- 
periodic impulse train, or a mixture of the two. The AR coefficients, a k , 

model the correlation structure or equivalently the spectral patterns of the 
signal. 

Assume that we have a signal record of N samples and that within this 
record a segment of M samples, starting from the sample k, x Uk = {-dA:), ..., 

x(k+M- 1)} are missing. The objective is to estimate the missing samples 
x Uk , using the remaining N-k samples and an AR model of the signal. 

Figure 10.8 illustrates the interpolation problem. For this signal record of N 
samples, the AR equation (10.53) can be expanded to form the following 
matrix equation: 



e(P) ' 

e{P + 1) 




r X(P) > 

X(P + 1) 




r X(P - 1) 

X(P) 


<N 

1 1 

a, " 

H * 


x(0) 

X(l) 


e(k - 1) 




X(k - 1) 




X(k - 2) 


X(k - 3) 


X(k - P - 1) 


e(k) 
e(k + 1) 
e(k + 2) 




x Uk ( k ) 
•^Uk ( k + 
•^Uk ( k + 2) 




X(k - 1) 

x Uk ( k ) 
x Uk ( k + 1 ) 


X(k - 2) 
X(k - 1) 

x Uk ( k ) 


X(k - P) 
X(k - P + 1) 
X(k - P + 2) 


e(k + M + P - 2) 
e(k + M + P - 1) 




X(k + M + P - 2) 
X(k + M + P - 1) 




X(k + M + P - 3) 
X(k + M + P - 2) 


X(k + M + P - 2) ... 

X(k + M + P - 1) ... 


( k + M — 2 ) 
•^Uk ( k + M ~ U 


e(k + M + P) 
e(k + M + P + 1) 




X(k + M + P) 
X(k + M + P + 1) 




X(k + M + P - 1) 
X(k + M + P) 


X(k + M + P ) ... 

X(k + M + P + 1) ... 


X(k + M ) 
X(k + M + 1) 


e(N-D } 




, X(N ~ 1) J 




^ X(N - 2) 


X(N - 3) 


X(N - P - 1) 



(10.54) 

where the subscript Uk denotes the unknown samples. Equation (10.54) can 
be rewritten in compact vector notation as 
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e(x Uk ,a)-x-Xa (10.55) 

where the error vector e(x Uk , a) is expressed as a function of the unknown 

samples and the unknown model coefficient vector. In this section, the 
optimality criteriobbn for the estimation of the model coefficient vector a 
and the missing samples x uk is the minimum mean square error given by 

the inner vector product 

e T e(x Uk ,a)-x J x+a T X T Xa -2a 1 X T x (10.56) 

The squared error function in Equation (10.56) involves nonlinear unknown 
terms of fourth order, dJX^Xa, and cubic order, adX^x. The least square 
error formulation, obtained by differentiating e J e(x {]] .,a), with respect to the 

vectors a or x Uk , results in a set of nonlinear equations of cubic order whose 

solution is non-trivial. A suboptimal, but practical and mathematically 
tractable, approach is to solve for the missing samples and the unknown 
model coefficients in two separate stages. This is an instance of the general 
estimate-and-maximise (EM) algorithm, and is similar to the linear- 
predictive model-based restoration considered in Section 6.7. In the first 
stage of the solution, Equation (10.54) is linearised by either assuming that 
the missing samples have zero values or discarding the set of equations in 
(10.54), between the two dashed lines, that involve the unknown signal 
samples. The linearised equations are used to solve for the AR model 
coefficient vector a by forming the equation 

MxLXK„r'hiL,*K„) (10.57) 

where the vector is an estimate of the model coefficients, obtained from the 
available signal samples. 

The second stage of the solution involves the estimation of the 
unknown signal samples x LIk . For an AR model of order P, and an unknown 

signal segment of length M, there are 2M+P nonlinear equations in (10.54) 
that involve the unknown samples; these are 
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e(k) ' 




r -Tuk(^) 




r x(k - 1) 


X(k - 2) ... X(k - p) \ 


r a x ^ 


e(k + 1) 




•Tuk (k + 




x uk (^) 


* 

1 

* 

1 

^5 

+ 


a 2 


e(k + 2) 




•^uk (k + 2 ) 


— 


•^uk + 


Xyk (k) . .. X(k - p + 2) 


<h 


e(k + M +P- 2) 




X(k + M + P - 2) 




Xuk (k + M + P — 3) (k + M + P — 4) ... Xpj k (k + M — 2) 


a p - 1 


e{k + M + P - 1) 




X(k + M + P- 1) 




dfpj k {k + M + P — 2) (k + M + P — 3) ... (k + M—Y) 


k a P 2 



(10.58) 



The estimate of the predictor coefficient vector , obtained from the first 
stage of the solution, is substituted in Equation (10.58) so that the only 
remaining unknowns in (10.58) are the missing signal samples. Equation 

(10.58) may be partitioned and rearranged in vector notation in the 
following form: 



f e{k) ^ 
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0 ) 


e(k + 1) 




- a x 
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e(k + 2) 
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- a } 
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0 
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~a 4 


-a 3 


-a 2 


— a x 
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— d P _ 3 
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e(k + P) 
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— a P 
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— d p _ 2 
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e(£ + P + l) 
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0 


- dp 


— a P _ x 
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0 ••• 


— dp_ x 


^(A: + Af +P-1) y 
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-dp ; 



*Uk(*) 
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X Uk (k + 2) 
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■^Uk (k + M — 
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0 
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0 ... 


— dp_ x 


— ftp- 2 


— dp_ 3 


-ftl , 


X(k + M + P - 1) 



(10.59) 

In Equation (10.59), the unknown and known samples are rearranged and 
grouped into two separate vectors. In a compact vector-matrix notation, 
Equation (10.58) can be written in the form 






(10.60) 
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where e is the error vector, A 1 is the first coefficient matrix, x Uk is the 
unknown signal vector being estimated, A 2 is the second coefficient matrix 
and the vector x Kn consists of the known samples in the signal matrix and 
vectors of Equation (10.58). The total squared error is given by 

e 1 e-{A^x^ + A 2 x Kn ) (A 1 x Uk + A2XKn) (10.61) 

The least square AR (LSAR) interpolation is obtained by minimisation of 
the squared error function with respect to the unknown signal samples x Uk : 

d T 

— =2A 1 T A 1 x Kn +2A 1 T A 2 x Kn =0 (10.62) 

From Equation (10.62) we have 

=-(Ai T A, ) _1 (a^ A 2 )x Kn (10.63) 

The solution in Equation (10.62) gives the vector which is the least 

square error estimate of the unknown data vector. 

10.3.4 Interpolation Based on Long-Term and Short-term 
Correlations 

For the best results, a model-based interpolation algorithm should utilise all 
the correlation structures of the signal process, including any periodic 
structures. For example, the main correlation structures in a voiced speech 
signal are the short-term correlation due to the resonance of the vocal tract 
and the long-term correlation due to the quasi-periodic excitation pulses of 
the glottal cords. For voiced speech, interpolation based on the short-term 
correlation does not perform well if the missing samples coincide with an 
underlying quasi-periodic excitation pulse. In this section, the AR 
interpolation is extended to include both long-term and short-term 
correlations. For most audio signals, the short-term correlation of each 
sample with the immediately preceding samples decays exponentially with 
time, and can be usually modelled with an AR model of order 10-20. In 
order to include the pitch periodicities in the AR model of Equation (10.53), 
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Figure 10.10 A quasiperiodic waveform. The sample marked “ ? ” is predicted using 
P immediate past samples and 2Q+ J \ samples a pitch period away. 



the model order must be greater than the pitch period. For speech signals, 
the pitch period is normally in the range 4-20 milliseconds, equivalent to 
40-200 samples at a sampling rate of 10 kHz. Implementation of an AR 
model of this order is not practical owing to stability problems and 
computational complexity. 

A more practical AR model that includes the effects of the long-term 
correlations is illustrated in Figure 10.10. This modified AR model may be 
expressed by the following equation: 

p Q 

x(m) = 'S'a k x(m — k) + p k x(m — T — k) + e(m) (10.64) 

k = 1 k=-Q 

The AR model of Equation (10.64) is composed of a short-term predictor 
Ib-k x(m-k ) that models the contribution of the P immediate past samples, 

and a long-term predictor £pk x(m~T-k) that models the contribution of 

2<2+l samples a pitch period away. The parameter T is the pitch period; it 
can be estimated from the autocorrelation function of x(m) as the time 
difference between the peak of the autocorrelation, which is at the 
correlation lag zero, and the second largest peak, which should happen a 
pitch period away from the lag zero. 

The AR model of Equation (10.64) is specified by the parameter vector 
c=[a l5 ..., cip, p_Q, ..., Pq\ and the pitch period T. Note that in Figure 10.10 
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after 



20+1 
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Figure 10.11 A signal with M missing samples. P immediate samples each side of 
the gap and 20+1 samples a pitch period away are used for interpolation. 



the sample marked “?” coincides with the onset of an excitation pulse. This 
sample is not well predictable from the P past samples, because they do not 
include a pulse event. The sample is more predictable from the 2Q + 1 
samples a pitch period away, since they include the effects of a similar 
excitation pulse. The predictor coefficients are estimated (see Chapter 7) 
using the so-called normal equations: 

c=R xx r xx (10.65) 

where R xx is the autocorrelation matrix of signal x and r xx is the correlation 
vector. In expanded form, Equation (10.65) can be written as 







r r(o) 


r(i) 


r(p - 1) 


i 

+ 


r(r + q) 


... r(r -Q- rp 


-1 


r r(i) N 


a 2 




r(i) 


m 


r(p- 2 ) 


r(T + Q - 2) 


i 

O) 

+ 


... r(j + Q - 2) 




r( 2 ) 


a 3 




r(2) 


r(i) 


r(p- 3) 


r(T + Q- 3) 


r(T + Q - 2) 


... r(T + Q - 3) 




r(3) 


a P 


- 


r(p - 1) 


r(p - 2) 


r(0) 


r(r + Q - p ) 


r(r + Q - p + l) 


... r(T + q-p) 




r(p) 


P-Q 




r(r + Q - l) 


r(r + Q - 2) 


r(r + Q-P) 


r(o) 


r(i) 


r( 2 < 2 ) 




r(r + Q) 


P-Q+ 1 




r(T + Q) 


r(T + Q - l) 


... r(T + Q - p + l) 


r(i) 


r(o) 


r(2Q - 1) 




i 

O) 

+ 


P + Q , 




r (T-Q-i) 


7T 

i 

. Oi 
i 


r(r -q-p) 


r(2 Q) 


r(2Q - 1) 


r(o) y 




v r(T-Q) y 



( 10 . 66 ) 

The modified AR model can be used for interpolation in the same way as 
the conventional AR model described in the previous section. Again, it is 
assumed that within a data window of N speech samples, a segment of M 
samples commencing from the sample point k, x Uk = {x(k), x(k+ 1), ..., 
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x(k+M- 1)} is missing. Figure 10.11 illustrates the interpolation problem. 
The missing samples are estimated using P samples in the immediate 
vicinity and 2Q+ 1 samples a pitch period away on each side of the missing 
signal. For the signal record of N samples, the modified AR equation 
(10.64) can be written in matrix form as 
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(10.67) 

where the subscript Uk denotes the unknown samples. In compact matrix 
notation, this set of equation can be written in the form 

e(x uk ,c)=x+Xc (10.68) 

As in Section 10.3.2, the interpolation problem is solved in two stages: 

(a) In the first stage, the known samples on both sides of the missing 
signal are used to estimate the AR coefficient vector c. 

(b) In the second stage, the AR coefficient estimates are substituted in 
Equation (10.68) so that the only unknowns are the data samples. 

The solution follows the same steps as those described in Section 10.3.2. 

10.3.5 LSAR Interpolation Error 

In this section, we discuss the effects of the signal characteristics, the model 
parameters and the number of unknown samples on the interpolation error. 
The interpolation error v(m), defined as the difference between the original 
sample x(m) and the interpolated sample x(m) , is given by 
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v(m) - x(m ) — x(m ) 



(10.69) 



A common measure of signal distortion is the mean square error distance 



defined as 



1 f M_1 1 

D(c,M) — — E< ^[x(k + m)—x(k + m)] 2 j- 

^ lm=0 J 



(10.70) 



where k is the beginning of an M-samples long segment of missing signal, 
and E [.] is the expectation operator. In Equation (10.70), the average 

distortion D is expressed as a function of the number of the unknown 
samples M, and also the model coefficient vector c. In general, the quality 
of interpolation depends on the following factors: 

(a) The signal correlation structure. For deterministic signals such as 
sine waves, the theoretical interpolation error is zero. However 
information-bearing signals have a degree of randomness that makes 
perfect interpolation with zero error an impossible objective. 

(b) The length of the missing segment. The amount of information lost, 
and hence the interpolation error, increase with the number of 
missing samples. Within a sequence of missing samples the error is 
usually largest for the samples in the middle of the gap. The 
interpolation Equation (10.63) becomes increasingly ill-conditioned 
as the length of the missing samples increases. 

(c) The nature of the excitation underlying the missing samples. The 
LSAR interpolation cannot account for any random excitation 
underlying the missing samples. In particular, the interpolation 
quality suffers when the missing samples coincide with the onset of 
an excitation pulse. In general, the least square error criterion causes 
the interpolator to underestimate the energy of the underlying 
excitation signal. The inclusion of long-term prediction and the use 
of quasi-periodic structure of signals improves the ability of the 
interpolator to restore the missing samples. 

(d) AR model order and the method used for estimation of the AR 
coefficients. The interpolation error depends on the AR model order. 
Usually a model order of 2-3 times the length of missing data 
sequence achieves good result. 
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(b) 



Figure 10.12 (a) A section of speech 
showing interpolation of 60 samples 
starting from the sample point 100 (b) 
Interpolation using short and long-term 
correlations. Interpolated samples are 
shown by the light shaded line. 




(a) 




(b) 



Figure 10.13 (a) A section of speech 
showing interpolation of 50 samples 
starting from the sample point 175 (b) 
Interpolation using short and long-term 
correlations. Interpolated samples are 
shown by the light shaded line. 



The interpolation error also depends on how well the AR parameters 
can be estimated from the incomplete data. In Equation (10.54), in the first 
stage of the solution, where the AR coefficients are estimated, two different 
approaches may be employed to linearise the system of equations. In the 
first approach all equations, between the dashed lines, that involve 
nonlinear terms are discarded. This approach has the advantage that no 
assumption is made about the missing samples. In fact, from a signal- 
ensemble point of view, the effect of discarding some equations is 
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equivalent to that of having a smaller signal record. In the second method, 
starting from an initial estimate of the unknown vector (such as x uk =0). 

Equation (10.54) is solved to obtain the AR parameters. The AR 
coefficients are then used in the second stage of the algorithm to estimate 
the unknown samples. These estimates may be improved in further 
iterations of the algorithm. The algorithm usually converges after one or 
two iterations. 

Figures 10.12 and 10.13 show the results of application of the least 
square error AR interpolation method to speech signals. The interpolated 
speech segments were chosen to coincide with the onset of an excitation 
pulse. In these experimental cases the original signals are available for 
comparison. Each signal was interpolated by the AR model of Equation 
(10.53) and also by the extended AR model of Equation (10.64). The length 
of the conventional linear predictor model was set to 20. The modified 
linear AR model of Equation (10.64) has a prediction order of (20,7); that 
is, the short-term predictor has 20 coefficients and the long-term predictor 
has 7 coefficients. The figures clearly demonstrate that the modified AR 
model that includes the long-term as well as the short-term correlation 
structures outperforms the conventional AR model. 

10.3.6 Interpolation in Frequency-Time Domain 

Time-domain, AR model-based interpolation methods are effective for the 
interpolation of a relatively short length of samples (say less than 100 
samples at a 20 kHz sampling rate), but suffer severe performance 
degradations when used for interpolation of large sequence of samples. This 
is partly due to the numerical problems associated with the inversion of a 
large matrix, involved in the time-domain interpolation of a large number of 
samples, Equation (10.58). 

Spectral-time representation provides a useful form for the interpolation of 
a large gap of missing samples. For example, through discrete Fourier 
transformation (DFT) and spectral-time representation of a signal, the 
problem of interpolation of a gap of N samples in the time domain can be 
converted into the problem of interpolation of a gap of one sample, along 
the time, in each of N discrete frequency bins, as explained next. 

Spectral-Time Representation with STFT 

A relatively simple and practical method for spectral-time representation of 
a signal is the short-time Fourier transform (STFT) method. To construct a 




Model-Based Interpolation 



327 






Block length 



Missing 

samples 




Block - 
overlap 



Figure 10.14 Illustration of segmentation of a signal (with a missing gap) for 

spectral-time representation. 




Time (Blocks) 



Figure 10.15 Spectral-time representation of a signal with a missing gap. 



two-dimensional STFT from a one-dimensional function of time x(m), the 
input signal is segmented into overlapping blocks of N samples, as 
illustrated in Figure 10.14. Each block is windowed, prior to discrete 
Fourier transformation, to reduce the spectral leakage due to the effects of 
discontinuities at the edges of the block. The frequency spectrum of the m th 
signal block is given by the discrete Fourier transform as 

X(k,m)=2_ l w(i)x(m(N-D) + i)e N , k= 0, N-l (10.71) 

i = 0 

where X(k,m) is a spectral-time representation with time index m and 
frequency index k, N is the number of samples in each block, and D is the 
block overlap. In STFT, it is assumed that the signal frequency composition 
is time-invariant within the duration of each block, but it may vary across 
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x(m)=g(x(m- 1), ...,x{m-P)) 



Figure 10.16 Configuration of a digital oscillator. 



the blocks. In general, the k th spectral component of a signal has a time- 
varying character, i.e. it is “born”, evolves for some time, disappears, and 
then reappears with a different intensity and a different characteristics. 
Figure 10.15 illustrates a spectral th time signal with a missing block of 
samples. The aim of interpolation is to fill in the signal gap such that, at the 
beginning and at the end of the gap, the continuity of both the magnitude 
and the phase of each frequency component of the signal is maintained. For 
most time-varying signals (such as speech), a low-order polynomial 
interpolator of the magnitude and the phase of the DFT components of the 
signal, making use of the few adjacent blocks on either side of the gap, 
would produce satisfactory results. 



10.3.7 Interpolation Using Adaptive Code Books 

In the LSAR interpolation method, described in Section 10.3.2, the signals 
are modelled as the output of an AR model excited by a random input. 
Given enough samples, the AR coefficients can be estimated with 
reasonable accuracy. However, the instantaneous values of the random 
excitation during the periods when the signal is missing cannot be 
recovered. This leads to a consistent underestimation of the amplitude and 
the energy of the interpolated samples. One solution to this problem is to 
use a zero-input signal model. Zero-input models are feedback oscillator 
systems that produce an output signal without requiring an input. 

The general form of the equation describing a digital nonlinear 
oscillator can be expressed as 

x(m)-g f (x(m - 1 ),x(m - 2), . . ., x(m - P )) (10.72) 

The mapping function gf(-) may be a parametric or a non-parametric 
mapping. The model in Equation (10.72) can be considered as a nonlinear 
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predictor, and the subscript / denotes forward prediction based on the past 
samples. 

A parametric model of a nonlinear oscillator can be formulated using a 
Volterra filter model. However, in this section, we consider a non- 
parametric method for its ease of formulation and stable characteristics. 
Kubin and Kleijin (1994) have described a non-parametric oscillator based 
on a codebook model of the signal process. 

In this method, each entry in the code book has P+ 1 samples where the 
(P+l) th sample is intended as an output. Given P input samples x=[x(m- 1), 
..., x(m-P)], the codebook output is the (P+1 ) th sample of the vector in the 
codebook whose first P samples have a minimum distance from the input 
signal x. For a signal record of length N samples, a codebook of size N-P 
vectors can be constructed by dividing the signal into overlapping segments 
of P+1 samples with the successive segments having an overlap of P 
samples. Similarly a backward oscillator can be expressed as 

x b ( m)=g b (x(m + 1 ),x(m + 2),- • • ,x(m + P)) (10.73) 



As in the case of a forward oscillator, the backward oscillator can be 
designed using a non-parametric method based on an adaptive codebook of 
the signal process. In this case each entry in the code book has P+1 samples 
where the first sample is intended as an output sample. Given P input 
samples x =[x{m), ..., x(m+P-l)] the codebook output is the first sample of 
the code book vector whose next P samples have a minimum distance from 
the input signal x. 

For interpolation of M missing samples, the ouputs of the forward and 
backward nonlinear oscillators may be combined as 



x(k + m) = 






v 



M - 1 



Xf(k + m)+ 



( 



m 



\ 



J 



v 



M - 1 



x b (k + m ) (10.74) 



) 



where it is assumed that the missing samples start at k. 



10.3.8 Interpolation Through Signal Substitution 

Audio signals often have a time-varying but quasi-periodic repetitive 
structure. Therefore most acoustic events in a signal record reoccur with 
some variations. This observation forms the basis for interpolation through 
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pattern matching, where a missing segment of a signal is substituted by the 
best match from a signal record. Consider a relatively long signal record of 
N samples, with a gap of M missing samples at its centre. A section of the 
signal with the gap in the middle can be used to search for the best-match 
segment in the record. The missing samples are then substituted by the 
corresponding section of the best-match signal. This interpolation method is 
particularly useful when the length of the missing signal segment is large. 
For a given class of signals, we may be able to construct a library of 
patterns for use in waveform substitution, Bogner (1989). 



10.4 Summary 

Interpolators, in their various forms, are used in most signal processing 
applications. The obvious example is the estimation of a sequence of 
missing samples. However, the use of an interpolator covers a much wider 
range of applications, from low-bit-rate speech coding to pattern 
recognition and decision making systems. We started this chapter with a 
study of the ideal interpolation of a band-limited signal, and its applications 
in digital-to-analog conversion and in multirate signal processing. In this 
chapter, various interpolation methods were categorised and studied in two 
different sections: one on polynomial interpolation, which is the more 
traditional numerical computing approach, and the other on statistical 
interpolation, which is the digital signal processing approach. 

The general form of the polynomial interpolator was formulated and its 
special forms, Lagrange, Newton, Hermite and cubic spline interpolators 
were considered. The polynomial methods are not equipped to make 
optimal use of the predictive and statistical structures of the signal, and are 
impractical for interpolation of a relatively large number of samples. A 
number of useful statistical interpolators were studied. These include 
maximum a posteriori interpolation, least square error AR interpolation, 
frequency-time interpolation, and an adaptive code book interpolator. 
Model-based interpolation method based on an autoregressive model is 
satisfactory for most audio applications so long as the length of the missing 
samples is not to large. For interpolation of a relatively large number of 
samples the time-frequency interpolation method and the adaptive code 
book method are more suitable. 
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11.1 Spectral Subtraction 

11.2 Processing Distortions 

11.3 Non-Linear Spectral Subtraction 

11.4 Implementation of Spectral Subtraction 

11.5 Summary 



S pectral subtraction is a method for restoration of the power spectrum 
or the magnitude spectrum of a signal observed in additive noise, 
through subtraction of an estimate of the average noise spectrum from 
the noisy signal spectrum. The noise spectrum is usually estimated, and 
updated, from the periods when the signal is absent and only the noise is 
present. The assumption is that the noise is a stationary or a slowly varying 
process, and that the noise spectrum does not change significantly in- 
between the update periods. For restoration of time-domain signals, an 
estimate of the instantaneous magnitude spectrum is combined with the 
phase of the noisy signal, and then transformed via an inverse discrete 
Fourier transform to the time domain. In terms of computational 
complexity, spectral subtraction is relatively inexpensive. However, owing 
to random variations of noise, spectral subtraction can result in negative 
estimates of the short-time magnitude or power spectrum. The magnitude 
and power spectrum are non-negative variables, and any negative estimates 
of these variables should be mapped into non-negative values. This non- 
linear rectification process distorts the distribution of the restored signal. 
The processing distortion becomes more noticeable as the signal-to-noise 
ratio decreases. In this chapter, we study spectral subtraction, and the 
different methods of reducing and removing the processing distortions. 
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11.1 Spectral Subtraction 

In applications where, in addition to the noisy signal, the noise is accessible 
on a separate channel, it may be possible to retrieve the signal by subtracting 
an estimate of the noise from the noisy signal. For example, the adaptive 
noise canceller of Section 1.3.1 takes as the inputs the noise and the noisy 
signal, and outputs an estimate of the clean signal. However, in many 
applications, such as at the receiver of a noisy communication channel, the 
only signal that is available is the noisy signal. In these situations, it is not 
possible to cancel out the random noise, but it may be possible to reduce the 
average effects of the noise on the signal spectrum. The effect of additive 
noise on the magnitude spectrum of a signal is to increase the mean and the 
variance of the spectrum as illustrated in Figure 11.1. The increase in the 
variance of the signal spectrum results from the random fluctuations of the 
noise, and cannot be cancelled out. The increase in the mean of the signal 
spectrum can be removed by subtraction of an estimate of the mean of the 
noise spectrum from the noisy signal spectrum. The noisy signal model in 
the time domain is given by 



y( m ) = x{m) + n{m) 



( 11 . 1 ) 




Figure 11.1 Illustrations of the effect of noise on a signal in the time and the 

frequency domains. 
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where y(m), x(m) and n(m) are the signal, the additive noise and the noisy 
signal respectively, and m is the discrete time index. In the frequency 
domain, the noisy signal model of Equation (1 1.1) is expressed as 

Y(f)=X(f)+N(f) (11.2) 

where IT/), X(f) and N(f) are the Fourier transforms of the noisy signal y(m), 
the original signal x(m ) and the noise n(m) respectively, and / is the 
frequency variable. In spectral subtraction, the incoming signal x{m) is 
buffered and divided into segments of N samples length. Each segment is 
windowed, using a Hanning or a Hamming window, and then transformed 
via discrete Fourier transform (DFT) to N spectral samples. The windows 
alleviate the effects of the discontinuities at the endpoints of each segment. 
The windowed signal is given by 

y w (m) = w(m)y(m) 

= w(m)[x(m)+ n(m)] (11-3) 

= x w (m)+n w (m) 

The windowing operation can be expressed in the frequency domain as 

= XJf)+N w (f) 



where the operator * denotes convolution. Throughout this chapter, it is 
assumed that the signals are windowed, and hence for simplicity we drop 
the use of the subscript w for windowed signals. 

Figure 11.2 illustrates a block diagram configuration of the spectral 
subtraction method. A more detailed implementation is described in Section 
1 1.4. The equation describing spectral subtraction may be expressed as 



\b 



b 



X(f) = Y(f) -a N(f) 



i b 



(11.5) 



^ b b 

where I X{f) I is an estimate of the original signal spectrum I X(f)\ and 

b 

I N(f ) I is the time-averaged noise spectra. It is assumed that the noise is a 

wide-sense stationary random process. For magnitude spectral subtraction, 
the exponent b=l, and for power spectral subtraction, b= 2. The parameter a 
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Figure 11.2 A block diagram illustration of spectral subtraction. 

in Equation (11.5) controls the amount of noise subtracted from the noisy 
signal. For full noise subtraction, a=\ and for over- subtraction col. The 
time-averaged noise spectrum is obtained from the periods when the signal 
is absent and only the noise is present as 

i K - 1 

\N(f)\ b =-^\ Ni (f)\ b (11.6) 

K i=0 

In Equation (11.6), \Ni(f)\ is the spectrum of the z th noise frame, and it is 
assumed that there are K frames in a noise-only period, where K is a 
variable. Alternatively, the averaged noise spectrum can be obtained as the 
output of a first order digital low-pass filter as 



\Ni(f)\ b =p\N^(f)\ b +(l-p)\Ni(f)\ b (11.7) 

where the low-pass filter coefficient p is typically set between 0.85 and 
0.99. For restoration of a time-domain signal, the magnitude spectrum 

A 

estimate I X(f ) I is combined with the phase of the noisy signal, and then 

transformed into the time domain via the inverse discrete Fourier transform 
as 

V— l _ • 2 k 1 

x(m)=^\X(k)\e jeY(k) e Jn “ (11.8) 

k = 0 

where 0 y (k) is the phase of the noisy signal frequency Y(k). The signal 
restoration equation (1 1.8) is based on the assumption that the audible noise 
is mainly due to the distortion of the magnitude spectrum, and that the phase 
distortion is largely inaudible. Evaluations of the perceptual effects of 
simulated phase distortions validate this assumption. 





